ferdinand.mom
|
3095ff4d4f
|
refactor organisation
|
2024-10-10 15:12:14 +00:00 |
|
ferdinand.mom
|
47581d29e9
|
make new modeling compatible with training
|
2024-10-10 15:08:23 +00:00 |
|
ferdinand.mom
|
8c155f47ce
|
all reduce gradient across DP & CP ranks
|
2024-09-26 14:00:06 +00:00 |
|
ferdinand.mom
|
31b5fb9efc
|
ugly ass display of grid (to be changed)
|
2024-09-26 13:45:53 +00:00 |
|
ferdinand.mom
|
b8065de7aa
|
support CPU training through gloo backend
|
2024-09-26 10:27:20 +00:00 |
|
ferdinand.mom
|
6f6bc1945a
|
add wandb support
|
2024-09-25 14:19:16 +00:00 |
|
ferdinand.mom
|
cfbf6c170e
|
every rank has now the loss
|
2024-09-25 14:12:31 +00:00 |
|
ferdinand.mom
|
b2e276d3b8
|
rename parallel_context to process_group_manager
|
2024-09-25 13:33:20 +00:00 |
|
ferdinand.mom
|
9e9ef8236e
|
refactor to decouple pp training with normal training
|
2024-09-25 13:17:05 +00:00 |
|
ferdinand.mom
|
e2c0747fe3
|
add naive DP
|
2024-09-25 12:36:22 +00:00 |
|
ferdinand.mom
|
7ba1383ebb
|
fixing socket bug by using dist.new_subgroups_by_enumeration instead
|
2024-09-24 13:43:22 +00:00 |
|
ferdinand.mom
|
7a57407c54
|
breaking: socketStartConnect: Connect to <ip address> failed : Software caused connection abort
|
2024-09-23 21:14:48 +00:00 |
|
ferdinand.mom
|
bce75fd508
|
enhance parallel context to handle 3D
|
2024-09-23 10:28:01 +00:00 |
|
ferdinand.mom
|
c36d415b47
|
add training and generate for pp
|
2024-09-19 14:06:46 +00:00 |
|