Go to file
2024-12-20 13:56:34 +01:00
assets fix stuff to make it CPU compliants 2024-12-18 16:50:36 +00:00
picotron refactor 2024-12-19 07:05:16 +00:00
template Merge branch 'main' into loading_big_model 2024-12-18 17:02:48 +00:00
tests Revert to @zzhhjjj class naming as it is more expressive 2024-12-17 15:55:18 +00:00
.gitignore small changes 2024-12-17 05:01:35 +00:00
create_config.py Merge branch 'main' into loading_big_model 2024-12-18 17:02:48 +00:00
extract_metrics.py add mfu parsing 2024-12-04 13:08:28 +00:00
LICENSE Update LICENSE 2024-12-19 09:14:13 +01:00
README.md Update README.md 2024-12-20 13:56:34 +01:00
requirements.txt use hf_transfer which improve download time by 3 2024-12-18 14:51:14 +00:00
setup.py fix stuff to make it CPU compliants 2024-12-18 16:50:36 +00:00
submit_slurm_jobs.py can now load big model through safetensors (sharded and single file) 2024-12-01 19:39:16 +00:00
train.py refactor 2024-12-19 07:05:16 +00:00

picotron

In the spirit of NanoGPT, we created Picotron: The minimalist & most-hackable repository for pre-training Llama-like models with 4D Parallelism (Data, Tensor, Pipeline, Context parallel). It is designed with simplicity and educational purposes in mind, making it an excellent tool for learning and experimentation.

  • The code itself is simple and readable: train.py, model.py and [data|tensor|pipeline|context]_parallel.py are all under 300 lines of code.

  • Performance is not the best but still under active development. We observed 38% MFU on a LLaMA-2-7B model using 64 H100 GPUs and nearly 50% MFU on the SmolLM-1.7B model with 8 H100 GPUs. Benchmarks will come soon

Tutorial videos

Install

pip install -e .

Quick start

  • Get a HF token here to download models from HuggingFace

  • GPU

    # To create a config file in json format under tmp by default
    python create_config.py --out_dir tmp --exp_name llama-1B --dp 8 --model_name HuggingFaceTB/SmolLM-1.7B --num_hidden_layers 15  --grad_acc_steps 32 --mbs 4 --seq_len 1024 --hf_token <HF_TOKEN>
    
    # Locally
    torchrun --nproc_per_node 8 train.py --config tmp/llama-1B/config.json 
    
    # 3D Parallelism
    python create_config.py --out_dir tmp --dp 4 --tp 2 --pp 2 --pp_engine 1f1b --exp_name llama-7B --model_name meta-llama/Llama-2-7b-hf  --grad_acc_steps 32 --mbs 4 --seq_len 1024 --hf_token <HF_TOKEN>
    
    # Slurm
    python submit_slurm_jobs.py --inp_dir tmp/llama-7B --qos high --hf_token <HF_TOKEN>
    
  • CPU (expect it to be slow)

    # 3D Parallelism on CPU
    python create_config.py --out_dir tmp --exp_name llama-1B-cpu --dp 2 --tp 2 --pp 2 --pp_engine 1f1b --model_name HuggingFaceTB/SmolLM-1.7B --num_hidden_layers 5  --grad_acc_steps 2 --mbs 4 --seq_len 128 --hf_token <HF_TOKEN> --use_cpu
    
    # Locally
    torchrun --nproc_per_node 8 train.py --config tmp/llama-1B-cpu/config.json
    

Acknowledgements