Commit Graph

146 Commits

Author SHA1 Message Date
Ferdinand Mom
099621fd94
Merge pull request #5 from huggingface/add-grad-acc-pp
Add gradient accumulation to PP + fix DP integration with PP (1f1b)
2024-11-04 21:14:04 +01:00
ferdinand.mom
a191212dda Merge branch 'main' into add-grad-acc-pp 2024-11-04 18:42:40 +00:00
Ferdinand Mom
ccf2a0a4ac
Merge pull request #7 from huggingface/refactoring
Refactoring
2024-11-04 19:39:34 +01:00
ferdinand.mom
7f11b912aa some fix 2024-11-04 16:57:00 +00:00
ferdinand.mom
bdaf0d1a1c some cleaning in train 2024-11-04 16:54:49 +00:00
ferdinand.mom
b33a5c8e5d better api when applying parallelism to train 2024-11-04 16:52:08 +00:00
ferdinand.mom
77e85fe490 split/merge into different files tp and dp 2024-11-04 16:26:11 +00:00
ferdinand.mom
db926026a6 folder refactoring + split cp & pp communications near implementation details 2024-11-04 16:10:47 +00:00
ferdinand.mom
1a000975be move model to picotron folder 2024-11-04 15:37:37 +00:00
ferdinand.mom
243f088170 separate dataloader from utils to data.py 2024-11-04 15:36:01 +00:00
ferdinand.mom
a5706858e0 move utils to picotron 2024-11-04 15:33:39 +00:00
ferdinand.mom
8af19d0caa picotron top level folder 2024-11-04 15:29:26 +00:00
ferdinand.mom
e7b4722160 remove unecessary files 2024-11-04 15:27:53 +00:00
ferdinand.mom
a90a4d1f2e Merge branch 'main' into add-grad-acc-pp 2024-11-04 15:07:10 +00:00
ferdinand.mom
41f49bb15f rename to grad_steps 2024-11-04 15:06:29 +00:00
ferdinand.mom
37be871710 Merge branch 'main' into add-grad-acc-pp 2024-11-04 15:02:43 +00:00
ferdinand.mom
0bfc06506a small changes unrelated to dp+pp sync grad fix 2024-11-04 15:00:43 +00:00
Ferdinand Mom
cecdafe515
Merge branch 'main' into add-grad-acc-pp 2024-11-04 15:56:31 +01:00
Ferdinand Mom
cce11da2cb
Merge pull request #6 from huggingface/pr1
various fix
2024-11-04 15:49:01 +01:00
ferdinand.mom
90868144a7 some dp renaming 2024-11-04 14:48:12 +00:00
ferdinand.mom
814e2a96ad fix multi-node training by using global rank instead of local rank for dist.init_process_group 2024-11-04 14:48:03 +00:00
ferdinand.mom
a44f905254 set num workers to 1 for now to avoid os memory error 2024-11-04 14:39:52 +00:00
ferdinand.mom
e19f74b715 add option for HF token 2024-11-04 14:39:12 +00:00
ferdinand.mom
7bfdf5f7d1 add fuse adam 2024-11-04 14:35:36 +00:00
ferdinand.mom
9d4f0ee4ff fix requirements to avoid drop in throughput 2024-11-04 14:33:07 +00:00
ferdinand.mom
519b506b2b add option to switch between pp engine 2024-11-04 14:32:44 +00:00
ferdinand.mom
7c381a61eb lower timeout in train 2024-11-04 14:28:01 +00:00
ferdinand.mom
4e1a6f8cdd set num worker to 1 otherwise OS memory error 2024-11-04 14:27:50 +00:00
ferdinand.mom
8e36bbe032 fix multi-node training by using global rank instead of local rank to init process_group 2024-11-03 00:14:14 +00:00
ferdinand.mom
b60b94b45f fix requirements to avoid drop in throughput 2024-11-02 02:18:49 +00:00
ferdinand.mom
8741c9b167 change distributed option to pass to multi-node 2024-11-02 02:18:49 +00:00
ferdinand.mom
bd6b8a0972 add hf token + fix multi-node training with torchrun args 2024-11-02 02:18:40 +00:00
ferdinand.mom
486c1763a6 add fuse adam 2024-11-02 01:38:14 +00:00
ferdinand.mom
7996a318c1 fix DP integation within PP (1f1b) 2024-11-01 20:08:48 +00:00
ferdinand.mom
2bafa3117d renaming + add option to switch between pp_engine (afab or 1f1b) 2024-11-01 20:07:57 +00:00
ferdinand.mom
f6c9a39d17 fix spliting input twice for context parallel (done in dataloader) 2024-10-30 15:43:42 +00:00
ferdinand.mom
363dbd5c05 need to update max position embeding when seq_len is greater (for rope) 2024-10-30 15:12:06 +00:00
ferdinand.mom
508d57f948 dont hardcode path 2024-10-30 14:58:41 +00:00
ferdinand.mom
f74bff79e0 cleaning 2024-10-30 14:58:41 +00:00
ferdinand.mom
2d198659e2 add slurm support 2024-10-30 14:58:41 +00:00
ferdinand.mom
fdf2df8344 add wandb
eaezaeea
2024-10-30 14:58:41 +00:00
ferdinand.mom
3c635092f9 add assert in TensorParallel for num_attention_heads and key_values_heads 2024-10-30 14:58:41 +00:00
ferdinand.mom
1dbe034d57 better config creation 2024-10-30 14:58:41 +00:00
ferdinand.mom
402aa4ccfc small change 2024-10-30 14:58:41 +00:00
zzhhjjj
f1f6915ba1 1f1b fix 2024-10-30 14:58:41 +00:00
zzhhjjj
c7a3fb016a disable grad sync in afab 2024-10-30 14:58:40 +00:00
ferdinand.mom
47c00be8c7 breaking: add slurm stuff 2024-10-29 15:44:35 +00:00
ferdinand.mom
987a7c5c99 add todo ring attention 2024-10-29 14:18:07 +00:00
ferdinand.mom
46af5b0425 some fixes 2024-10-29 14:17:42 +00:00
zzhhjjj
b7f3e253be add context parallel 2024-10-29 13:42:38 +00:00