Commit Graph

15 Commits

Author SHA1 Message Date
ferdinand.mom
7daefd31ee Merge branch 'main' into loading_big_model 2024-12-18 17:02:48 +00:00
ferdinand.mom
b647f58289 fix stuff to make it CPU compliants 2024-12-18 16:50:36 +00:00
ferdinand.mom
b49ddac4b4 add tqdm when subprocess 2024-12-18 16:07:34 +00:00
ferdinand.mom
f1053e3cbe revert to use huggingface cli + hf_transfers (this will not create snapshots/blob folder etc through CLI use) 2024-12-18 15:51:04 +00:00
ferdinand.mom
0360ec0d2a use hf_transfer which improve download time by 3 2024-12-18 14:51:14 +00:00
ferdinand.mom
75cd0d77f9 download safetensors when creating config time. If we do it in training, barrier() may tiemout while waiting for download 2024-12-17 15:46:16 +00:00
ferdinand.mom
b57b8277d1 breaking: add new version of initi meta device but memory leaks 2024-12-17 15:46:16 +00:00
ferdinand.mom
859650a2c0 breaking: refactor loading big model to only download safetensors files 2024-12-17 15:46:09 +00:00
ferdinand.mom
00ddbd9d2e raise Exception when not enough layers to distributed in rank + rename variable 2024-12-03 13:17:52 +00:00
ferdinand.mom
32d8daa880 can now load big model through safetensors (sharded and single file) 2024-12-01 19:39:16 +00:00
ferdinand.mom
41f49bb15f rename to grad_steps 2024-11-04 15:06:29 +00:00
ferdinand.mom
0bfc06506a small changes unrelated to dp+pp sync grad fix 2024-11-04 15:00:43 +00:00
ferdinand.mom
7bfdf5f7d1 add fuse adam 2024-11-04 14:35:36 +00:00
ferdinand.mom
519b506b2b add option to switch between pp engine 2024-11-04 14:32:44 +00:00
ferdinand.mom
f74bff79e0 cleaning 2024-10-30 14:58:41 +00:00