Commit Graph

68 Commits

Author SHA1 Message Date
ferdinand.mom
81726dfffe accelerate dataset mapping 2024-10-15 13:32:44 +00:00
ferdinand.mom
1ca7365506 stitch cp dp cp together 2024-10-15 13:06:17 +00:00
ferdinand.mom
ffea3d2ad1 add context parallel for training 2024-10-15 12:43:28 +00:00
ferdinand.mom
1e229cae88 renaming 2024-10-14 09:26:31 +00:00
ferdinand.mom
3095ff4d4f refactor organisation 2024-10-10 15:12:14 +00:00
ferdinand.mom
47581d29e9 make new modeling compatible with training 2024-10-10 15:08:23 +00:00
ferdinand.mom
8c155f47ce all reduce gradient across DP & CP ranks 2024-09-26 14:00:06 +00:00
ferdinand.mom
31b5fb9efc ugly ass display of grid (to be changed) 2024-09-26 13:45:53 +00:00
ferdinand.mom
b8065de7aa support CPU training through gloo backend 2024-09-26 10:27:20 +00:00
ferdinand.mom
6f6bc1945a add wandb support 2024-09-25 14:19:16 +00:00
ferdinand.mom
cfbf6c170e every rank has now the loss 2024-09-25 14:12:31 +00:00
ferdinand.mom
b2e276d3b8 rename parallel_context to process_group_manager 2024-09-25 13:33:20 +00:00
ferdinand.mom
9e9ef8236e refactor to decouple pp training with normal training 2024-09-25 13:17:05 +00:00
ferdinand.mom
e2c0747fe3 add naive DP 2024-09-25 12:36:22 +00:00
ferdinand.mom
7ba1383ebb fixing socket bug by using dist.new_subgroups_by_enumeration instead 2024-09-24 13:43:22 +00:00
ferdinand.mom
7a57407c54 breaking: socketStartConnect: Connect to <ip address> failed : Software caused connection abort 2024-09-23 21:14:48 +00:00
ferdinand.mom
bce75fd508 enhance parallel context to handle 3D 2024-09-23 10:28:01 +00:00
ferdinand.mom
c36d415b47 add training and generate for pp 2024-09-19 14:06:46 +00:00