Commit Graph

6 Commits

Author SHA1 Message Date
ferdinand.mom
4e1a6f8cdd set num worker to 1 otherwise OS memory error 2024-11-04 14:27:50 +00:00
ferdinand.mom
8e36bbe032 fix multi-node training by using global rank instead of local rank to init process_group 2024-11-03 00:14:14 +00:00
ferdinand.mom
8741c9b167 change distributed option to pass to multi-node 2024-11-02 02:18:49 +00:00
ferdinand.mom
bd6b8a0972 add hf token + fix multi-node training with torchrun args 2024-11-02 02:18:40 +00:00
ferdinand.mom
486c1763a6 add fuse adam 2024-11-02 01:38:14 +00:00
ferdinand.mom
f74bff79e0 cleaning 2024-10-30 14:58:41 +00:00