Commit Graph

58 Commits

Author SHA1 Message Date
ferdinand.mom
f74bff79e0 cleaning 2024-10-30 14:58:41 +00:00
ferdinand.mom
2d198659e2 add slurm support 2024-10-30 14:58:41 +00:00
ferdinand.mom
fdf2df8344 add wandb
eaezaeea
2024-10-30 14:58:41 +00:00
ferdinand.mom
3c635092f9 add assert in TensorParallel for num_attention_heads and key_values_heads 2024-10-30 14:58:41 +00:00
ferdinand.mom
1dbe034d57 better config creation 2024-10-30 14:58:41 +00:00
ferdinand.mom
402aa4ccfc small change 2024-10-30 14:58:41 +00:00
zzhhjjj
f1f6915ba1 1f1b fix 2024-10-30 14:58:41 +00:00
zzhhjjj
c7a3fb016a disable grad sync in afab 2024-10-30 14:58:40 +00:00
ferdinand.mom
47c00be8c7 breaking: add slurm stuff 2024-10-29 15:44:35 +00:00
ferdinand.mom
987a7c5c99 add todo ring attention 2024-10-29 14:18:07 +00:00
ferdinand.mom
46af5b0425 some fixes 2024-10-29 14:17:42 +00:00
zzhhjjj
b7f3e253be add context parallel 2024-10-29 13:42:38 +00:00
zzhhjjj
6220892716 refactor 2024-10-28 20:44:15 +00:00
zzhhjjj
5181c4cd87 typo 2024-10-28 11:02:38 +00:00
zzhhjjj
a17ddb691f gradient accumulation 2024-10-28 07:46:23 +00:00
zzhhjjj
2f8c87f4d1 save/load weights 2024-10-28 05:19:59 +00:00
zzhhjjj
928ada77b8 process group order 2024-10-27 04:56:54 +00:00
zzhhjjj
762127afcd some logs,will clean later 2024-10-27 02:22:36 +00:00
zzhhjjj
e5cfb5240e match TP loss 2024-10-27 02:22:05 +00:00
zzhhjjj
51b5683dd3 match tp+pp loss 2024-10-27 02:20:18 +00:00
zzhhjjj
63307c79a1 add some logs, refactor dataloader 2024-10-23 00:38:27 +00:00
zzhhjjj
ec1e1e5ccf support bf16, all reduce loss 2024-10-22 23:38:44 +00:00
zzhhjjj
a6d79b07b5 add cuda kernels 2024-10-22 22:38:29 +00:00
zzhhjjj
9a7904d5d6 revert some change 2024-10-22 19:50:23 +00:00
ferdinand.mom
9d53e9afa6 use global pgm for ddp 2024-10-18 15:51:26 +00:00
ferdinand.mom
2b2781a374 made Tensor Parallel API compliant 2024-10-18 15:51:26 +00:00
ferdinand.mom
abd1edf9f9 all_reduce loss across pp/dp ranks + base_parallel 2024-10-18 15:51:17 +00:00
ferdinand.mom
1ebd3de5be Merge DDP + TP from @zzhhjjj 2024-10-18 15:05:01 +00:00
ferdinand.mom
83ddda2ce8 leave out CP integration at the very end 2024-10-18 14:59:39 +00:00
ferdinand.mom
d0d6d8994f use global pgm for ddp 2024-10-18 14:59:26 +00:00
ferdinand.mom
134d48b658 remove merged qkv 2024-10-18 14:59:04 +00:00
ferdinand.mom
0b1d02a402 various fix (modeling, dataloader, cpu load) 2024-10-18 14:33:46 +00:00
zzhhjjj
7377238741 tesnsor parallel, will clean later 2024-10-18 05:13:44 +00:00
zzhhjjj
54ad77e055 Merge branch 'main' into ddp-merge 2024-10-16 19:13:48 +00:00
zzhhjjj
24ff8d05fd add DDP 2024-10-16 16:48:55 +00:00
zzhhjjj
5139a32211 repo structure change 2024-10-16 16:44:39 +00:00
zzhhjjj
1aba6079e8 model file change. This requires some change on PP 2024-10-16 16:41:12 +00:00
zzhhjjj
6080c2f26b dataloader 2024-10-16 15:58:35 +00:00
ferdinand.mom
81726dfffe accelerate dataset mapping 2024-10-15 13:32:44 +00:00
ferdinand.mom
1ca7365506 stitch cp dp cp together 2024-10-15 13:06:17 +00:00
ferdinand.mom
ffea3d2ad1 add context parallel for training 2024-10-15 12:43:28 +00:00
ferdinand.mom
1e229cae88 renaming 2024-10-14 09:26:31 +00:00
ferdinand.mom
3095ff4d4f refactor organisation 2024-10-10 15:12:14 +00:00
ferdinand.mom
47581d29e9 make new modeling compatible with training 2024-10-10 15:08:23 +00:00
ferdinand.mom
770800b978 add new modeling 2024-10-10 14:57:17 +00:00
ferdinand.mom
8c155f47ce all reduce gradient across DP & CP ranks 2024-09-26 14:00:06 +00:00
ferdinand.mom
31b5fb9efc ugly ass display of grid (to be changed) 2024-09-26 13:45:53 +00:00
ferdinand.mom
b8065de7aa support CPU training through gloo backend 2024-09-26 10:27:20 +00:00
ferdinand.mom
6f6bc1945a add wandb support 2024-09-25 14:19:16 +00:00
ferdinand.mom
cfbf6c170e every rank has now the loss 2024-09-25 14:12:31 +00:00