ferdinand.mom
|
4e1a6f8cdd
|
set num worker to 1 otherwise OS memory error
|
2024-11-04 14:27:50 +00:00 |
|
ferdinand.mom
|
8e36bbe032
|
fix multi-node training by using global rank instead of local rank to init process_group
|
2024-11-03 00:14:14 +00:00 |
|
ferdinand.mom
|
8741c9b167
|
change distributed option to pass to multi-node
|
2024-11-02 02:18:49 +00:00 |
|
ferdinand.mom
|
bd6b8a0972
|
add hf token + fix multi-node training with torchrun args
|
2024-11-02 02:18:40 +00:00 |
|
ferdinand.mom
|
486c1763a6
|
add fuse adam
|
2024-11-02 01:38:14 +00:00 |
|
ferdinand.mom
|
f74bff79e0
|
cleaning
|
2024-10-30 14:58:41 +00:00 |
|