Commit Graph

146 Commits

Author SHA1 Message Date
Ferdinand Mom
df3ae8a5f0
Update README.md 2024-12-20 13:56:34 +01:00
Ferdinand Mom
bf03420686
Update README.md 2024-12-19 09:45:01 +01:00
Ferdinand Mom
164ab81e27
Update README.md 2024-12-19 09:24:14 +01:00
Haojun Zhao
78ba56ce80
Merge pull request #11 from eliebak/patch-1
Update LICENSE
2024-12-19 03:19:53 -05:00
elie
009bb0b2a8
Update LICENSE 2024-12-19 09:14:13 +01:00
Ferdinand Mom
59b841e3cb
Create LICENSE 2024-12-19 08:57:47 +01:00
zzhhjjj
7ef2344cd4 refactor 2024-12-19 07:05:16 +00:00
zzhhjjj
d855aead9e readme 2024-12-19 06:31:03 +00:00
zzhhjjj
e82c719f31 Update Readme 2024-12-19 06:04:04 +00:00
zzhhjjj
a727b986cb refactor 2024-12-19 05:48:29 +00:00
Haojun Zhao
439e23fdba
Merge pull request #10 from huggingface/loading_big_model
Loading big model
2024-12-19 00:29:16 -05:00
ferdinand.mom
7daefd31ee Merge branch 'main' into loading_big_model 2024-12-18 17:02:48 +00:00
ferdinand.mom
b647f58289 fix stuff to make it CPU compliants 2024-12-18 16:50:36 +00:00
ferdinand.mom
b49ddac4b4 add tqdm when subprocess 2024-12-18 16:07:34 +00:00
zzhhjjj
fc3b50b033 small change for dataloader arguments 2024-12-18 15:55:55 +00:00
ferdinand.mom
f1053e3cbe revert to use huggingface cli + hf_transfers (this will not create snapshots/blob folder etc through CLI use) 2024-12-18 15:51:04 +00:00
ferdinand.mom
0360ec0d2a use hf_transfer which improve download time by 3 2024-12-18 14:51:14 +00:00
ferdinand.mom
86c9b91d02 Revert to @zzhhjjj class naming as it is more expressive 2024-12-17 15:55:18 +00:00
ferdinand.mom
75cd0d77f9 download safetensors when creating config time. If we do it in training, barrier() may tiemout while waiting for download 2024-12-17 15:46:16 +00:00
ferdinand.mom
b57b8277d1 breaking: add new version of initi meta device but memory leaks 2024-12-17 15:46:16 +00:00
ferdinand.mom
859650a2c0 breaking: refactor loading big model to only download safetensors files 2024-12-17 15:46:09 +00:00
ferdinand.mom
43f39ff9ec Merge remote-tracking branch 'origin/main' into loading_big_model 2024-12-17 15:45:38 +00:00
ferdinand.mom
b0ea5066ad small changes 2024-12-17 05:01:35 +00:00
Haojun Zhao
55efb321f9
Merge pull request #9 from huggingface/async_tp
Async tp
2024-12-14 07:24:35 -05:00
zzhhjjj
481eeb8377 stop iteration fix. recreate a new dataloder 2024-12-13 13:29:11 +00:00
ferdinand.mom
b390a0101e add mfu parsing 2024-12-04 13:08:28 +00:00
ferdinand.mom
aaa4a083e9 remove clone() in tp communications as torch.compile will optimize this out anyway 2024-12-03 16:26:41 +00:00
ferdinand.mom
09dfd1676f broadcast tokenizer to every rank as well 2024-12-03 14:20:44 +00:00
ferdinand.mom
00ddbd9d2e raise Exception when not enough layers to distributed in rank + rename variable 2024-12-03 13:17:52 +00:00
ferdinand.mom
b80091e8ec set 1f1b by default 2024-12-03 10:10:12 +00:00
ferdinand.mom
86a0fc5e3d avois OS memory error with num_workers > 1 2024-12-02 18:33:35 +00:00
ferdinand.mom
75939867d9 small fix on world_size with pgm 2024-12-02 18:12:02 +00:00
ferdinand.mom
b6267c768e fix issue with too many rank reading to HF library 2024-12-02 15:36:47 +00:00
Ferdinand Mom
52a9779345
Merge branch 'main' into loading_big_model 2024-12-01 16:34:43 -04:00
ferdinand.mom
a84a9d5942 now handle TP + PP meta device 2024-12-01 20:26:40 +00:00
ferdinand.mom
bccee5d037 clean long line of hyperparameters 2024-12-01 20:00:05 +00:00
ferdinand.mom
804f43c97e more consistent naming 2024-12-01 19:45:11 +00:00
ferdinand.mom
32d8daa880 can now load big model through safetensors (sharded and single file) 2024-12-01 19:39:16 +00:00
ferdinand.mom
012aad3167 fix extract metrics 2024-12-01 03:43:04 +00:00
ferdinand.mom
3c6c1e3af1 add reset parameters for initialize_model_with_materialized_weights 2024-12-01 03:43:04 +00:00
ferdinand.mom
270c469531 refactor tensor parallel 2024-12-01 03:43:04 +00:00
ferdinand.mom
daea1fed3f refactor checkpoint 2024-12-01 03:43:00 +00:00
ferdinand.mom
5045be87e0 wip: load big model with meta device 2024-11-29 16:38:42 +00:00
zzhhjjj
069a17237f update wandb_log + set async default to true 2024-11-21 17:48:26 +00:00
zzhhjjj
ca1fcec87f Merge branch 'main' into async_tp 2024-11-20 02:00:33 +00:00
zzhhjjj
ebef9a36e3 remove redundancy 2024-11-20 01:58:44 +00:00
zzhhjjj
2d6d9fb6b1 async TP + test 2024-11-20 01:55:02 +00:00
zzhhjjj
16d85cdb3a mfu ref/typo 2024-11-18 17:57:02 +00:00
Ferdinand Mom
a2ce795837
Merge pull request #8 from huggingface/add_mfu
add mfu, get number of parameters
2024-11-18 13:45:32 -04:00
zzhhjjj
191f7425e1 add mfu, get number of parameters 2024-11-18 17:36:51 +00:00