ferdinand.mom
|
7daefd31ee
|
Merge branch 'main' into loading_big_model
|
2024-12-18 17:02:48 +00:00 |
|
ferdinand.mom
|
b647f58289
|
fix stuff to make it CPU compliants
|
2024-12-18 16:50:36 +00:00 |
|
ferdinand.mom
|
b49ddac4b4
|
add tqdm when subprocess
|
2024-12-18 16:07:34 +00:00 |
|
ferdinand.mom
|
f1053e3cbe
|
revert to use huggingface cli + hf_transfers (this will not create snapshots/blob folder etc through CLI use)
|
2024-12-18 15:51:04 +00:00 |
|
ferdinand.mom
|
0360ec0d2a
|
use hf_transfer which improve download time by 3
|
2024-12-18 14:51:14 +00:00 |
|
ferdinand.mom
|
75cd0d77f9
|
download safetensors when creating config time. If we do it in training, barrier() may tiemout while waiting for download
|
2024-12-17 15:46:16 +00:00 |
|
ferdinand.mom
|
b57b8277d1
|
breaking: add new version of initi meta device but memory leaks
|
2024-12-17 15:46:16 +00:00 |
|
ferdinand.mom
|
859650a2c0
|
breaking: refactor loading big model to only download safetensors files
|
2024-12-17 15:46:09 +00:00 |
|
ferdinand.mom
|
00ddbd9d2e
|
raise Exception when not enough layers to distributed in rank + rename variable
|
2024-12-03 13:17:52 +00:00 |
|
ferdinand.mom
|
32d8daa880
|
can now load big model through safetensors (sharded and single file)
|
2024-12-01 19:39:16 +00:00 |
|
ferdinand.mom
|
41f49bb15f
|
rename to grad_steps
|
2024-11-04 15:06:29 +00:00 |
|
ferdinand.mom
|
0bfc06506a
|
small changes unrelated to dp+pp sync grad fix
|
2024-11-04 15:00:43 +00:00 |
|
ferdinand.mom
|
7bfdf5f7d1
|
add fuse adam
|
2024-11-04 14:35:36 +00:00 |
|
ferdinand.mom
|
519b506b2b
|
add option to switch between pp engine
|
2024-11-04 14:32:44 +00:00 |
|
ferdinand.mom
|
f74bff79e0
|
cleaning
|
2024-10-30 14:58:41 +00:00 |
|