ferdinand.mom
|
6f6bc1945a
|
add wandb support
|
2024-09-25 14:19:16 +00:00 |
|
ferdinand.mom
|
cfbf6c170e
|
every rank has now the loss
|
2024-09-25 14:12:31 +00:00 |
|
ferdinand.mom
|
b2e276d3b8
|
rename parallel_context to process_group_manager
|
2024-09-25 13:33:20 +00:00 |
|
ferdinand.mom
|
9e9ef8236e
|
refactor to decouple pp training with normal training
|
2024-09-25 13:17:05 +00:00 |
|
ferdinand.mom
|
e2c0747fe3
|
add naive DP
|
2024-09-25 12:36:22 +00:00 |
|
ferdinand.mom
|
7ba1383ebb
|
fixing socket bug by using dist.new_subgroups_by_enumeration instead
|
2024-09-24 13:43:22 +00:00 |
|
ferdinand.mom
|
7a57407c54
|
breaking: socketStartConnect: Connect to <ip address> failed : Software caused connection abort
|
2024-09-23 21:14:48 +00:00 |
|
ferdinand.mom
|
bce75fd508
|
enhance parallel context to handle 3D
|
2024-09-23 10:28:01 +00:00 |
|
ferdinand.mom
|
c36d415b47
|
add training and generate for pp
|
2024-09-19 14:06:46 +00:00 |
|