Commit Graph

82 Commits

Author SHA1 Message Date
Ferdinand Mom
ccf2a0a4ac
Merge pull request #7 from huggingface/refactoring
Refactoring
2024-11-04 19:39:34 +01:00
ferdinand.mom
7f11b912aa some fix 2024-11-04 16:57:00 +00:00
ferdinand.mom
bdaf0d1a1c some cleaning in train 2024-11-04 16:54:49 +00:00
ferdinand.mom
b33a5c8e5d better api when applying parallelism to train 2024-11-04 16:52:08 +00:00
ferdinand.mom
77e85fe490 split/merge into different files tp and dp 2024-11-04 16:26:11 +00:00
ferdinand.mom
db926026a6 folder refactoring + split cp & pp communications near implementation details 2024-11-04 16:10:47 +00:00
ferdinand.mom
1a000975be move model to picotron folder 2024-11-04 15:37:37 +00:00
ferdinand.mom
243f088170 separate dataloader from utils to data.py 2024-11-04 15:36:01 +00:00
ferdinand.mom
a5706858e0 move utils to picotron 2024-11-04 15:33:39 +00:00
ferdinand.mom
8af19d0caa picotron top level folder 2024-11-04 15:29:26 +00:00
ferdinand.mom
e7b4722160 remove unecessary files 2024-11-04 15:27:53 +00:00
ferdinand.mom
41f49bb15f rename to grad_steps 2024-11-04 15:06:29 +00:00
ferdinand.mom
0bfc06506a small changes unrelated to dp+pp sync grad fix 2024-11-04 15:00:43 +00:00
Ferdinand Mom
cce11da2cb
Merge pull request #6 from huggingface/pr1
various fix
2024-11-04 15:49:01 +01:00
ferdinand.mom
90868144a7 some dp renaming 2024-11-04 14:48:12 +00:00
ferdinand.mom
814e2a96ad fix multi-node training by using global rank instead of local rank for dist.init_process_group 2024-11-04 14:48:03 +00:00
ferdinand.mom
a44f905254 set num workers to 1 for now to avoid os memory error 2024-11-04 14:39:52 +00:00
ferdinand.mom
e19f74b715 add option for HF token 2024-11-04 14:39:12 +00:00
ferdinand.mom
7bfdf5f7d1 add fuse adam 2024-11-04 14:35:36 +00:00
ferdinand.mom
9d4f0ee4ff fix requirements to avoid drop in throughput 2024-11-04 14:33:07 +00:00
ferdinand.mom
519b506b2b add option to switch between pp engine 2024-11-04 14:32:44 +00:00
ferdinand.mom
f6c9a39d17 fix spliting input twice for context parallel (done in dataloader) 2024-10-30 15:43:42 +00:00
ferdinand.mom
363dbd5c05 need to update max position embeding when seq_len is greater (for rope) 2024-10-30 15:12:06 +00:00
ferdinand.mom
508d57f948 dont hardcode path 2024-10-30 14:58:41 +00:00
ferdinand.mom
f74bff79e0 cleaning 2024-10-30 14:58:41 +00:00
ferdinand.mom
2d198659e2 add slurm support 2024-10-30 14:58:41 +00:00
ferdinand.mom
fdf2df8344 add wandb
eaezaeea
2024-10-30 14:58:41 +00:00
ferdinand.mom
3c635092f9 add assert in TensorParallel for num_attention_heads and key_values_heads 2024-10-30 14:58:41 +00:00
ferdinand.mom
1dbe034d57 better config creation 2024-10-30 14:58:41 +00:00
ferdinand.mom
402aa4ccfc small change 2024-10-30 14:58:41 +00:00
zzhhjjj
f1f6915ba1 1f1b fix 2024-10-30 14:58:41 +00:00
zzhhjjj
c7a3fb016a disable grad sync in afab 2024-10-30 14:58:40 +00:00
ferdinand.mom
47c00be8c7 breaking: add slurm stuff 2024-10-29 15:44:35 +00:00
ferdinand.mom
987a7c5c99 add todo ring attention 2024-10-29 14:18:07 +00:00
ferdinand.mom
46af5b0425 some fixes 2024-10-29 14:17:42 +00:00
zzhhjjj
b7f3e253be add context parallel 2024-10-29 13:42:38 +00:00
zzhhjjj
6220892716 refactor 2024-10-28 20:44:15 +00:00
zzhhjjj
5181c4cd87 typo 2024-10-28 11:02:38 +00:00
zzhhjjj
a17ddb691f gradient accumulation 2024-10-28 07:46:23 +00:00
zzhhjjj
2f8c87f4d1 save/load weights 2024-10-28 05:19:59 +00:00
zzhhjjj
928ada77b8 process group order 2024-10-27 04:56:54 +00:00
zzhhjjj
762127afcd some logs,will clean later 2024-10-27 02:22:36 +00:00
zzhhjjj
e5cfb5240e match TP loss 2024-10-27 02:22:05 +00:00
zzhhjjj
51b5683dd3 match tp+pp loss 2024-10-27 02:20:18 +00:00
zzhhjjj
63307c79a1 add some logs, refactor dataloader 2024-10-23 00:38:27 +00:00
zzhhjjj
ec1e1e5ccf support bf16, all reduce loss 2024-10-22 23:38:44 +00:00
zzhhjjj
a6d79b07b5 add cuda kernels 2024-10-22 22:38:29 +00:00
zzhhjjj
9a7904d5d6 revert some change 2024-10-22 19:50:23 +00:00
ferdinand.mom
9d53e9afa6 use global pgm for ddp 2024-10-18 15:51:26 +00:00
ferdinand.mom
2b2781a374 made Tensor Parallel API compliant 2024-10-18 15:51:26 +00:00