Ferdinand Mom
|
df3ae8a5f0
|
Update README.md
|
2024-12-20 13:56:34 +01:00 |
|
Ferdinand Mom
|
bf03420686
|
Update README.md
|
2024-12-19 09:45:01 +01:00 |
|
Ferdinand Mom
|
164ab81e27
|
Update README.md
|
2024-12-19 09:24:14 +01:00 |
|
Haojun Zhao
|
78ba56ce80
|
Merge pull request #11 from eliebak/patch-1
Update LICENSE
|
2024-12-19 03:19:53 -05:00 |
|
elie
|
009bb0b2a8
|
Update LICENSE
|
2024-12-19 09:14:13 +01:00 |
|
Ferdinand Mom
|
59b841e3cb
|
Create LICENSE
|
2024-12-19 08:57:47 +01:00 |
|
zzhhjjj
|
7ef2344cd4
|
refactor
|
2024-12-19 07:05:16 +00:00 |
|
zzhhjjj
|
d855aead9e
|
readme
|
2024-12-19 06:31:03 +00:00 |
|
zzhhjjj
|
e82c719f31
|
Update Readme
|
2024-12-19 06:04:04 +00:00 |
|
zzhhjjj
|
a727b986cb
|
refactor
|
2024-12-19 05:48:29 +00:00 |
|
Haojun Zhao
|
439e23fdba
|
Merge pull request #10 from huggingface/loading_big_model
Loading big model
|
2024-12-19 00:29:16 -05:00 |
|
ferdinand.mom
|
7daefd31ee
|
Merge branch 'main' into loading_big_model
|
2024-12-18 17:02:48 +00:00 |
|
ferdinand.mom
|
b647f58289
|
fix stuff to make it CPU compliants
|
2024-12-18 16:50:36 +00:00 |
|
ferdinand.mom
|
b49ddac4b4
|
add tqdm when subprocess
|
2024-12-18 16:07:34 +00:00 |
|
zzhhjjj
|
fc3b50b033
|
small change for dataloader arguments
|
2024-12-18 15:55:55 +00:00 |
|
ferdinand.mom
|
f1053e3cbe
|
revert to use huggingface cli + hf_transfers (this will not create snapshots/blob folder etc through CLI use)
|
2024-12-18 15:51:04 +00:00 |
|
ferdinand.mom
|
0360ec0d2a
|
use hf_transfer which improve download time by 3
|
2024-12-18 14:51:14 +00:00 |
|
ferdinand.mom
|
86c9b91d02
|
Revert to @zzhhjjj class naming as it is more expressive
|
2024-12-17 15:55:18 +00:00 |
|
ferdinand.mom
|
75cd0d77f9
|
download safetensors when creating config time. If we do it in training, barrier() may tiemout while waiting for download
|
2024-12-17 15:46:16 +00:00 |
|
ferdinand.mom
|
b57b8277d1
|
breaking: add new version of initi meta device but memory leaks
|
2024-12-17 15:46:16 +00:00 |
|
ferdinand.mom
|
859650a2c0
|
breaking: refactor loading big model to only download safetensors files
|
2024-12-17 15:46:09 +00:00 |
|
ferdinand.mom
|
43f39ff9ec
|
Merge remote-tracking branch 'origin/main' into loading_big_model
|
2024-12-17 15:45:38 +00:00 |
|
ferdinand.mom
|
b0ea5066ad
|
small changes
|
2024-12-17 05:01:35 +00:00 |
|
Haojun Zhao
|
55efb321f9
|
Merge pull request #9 from huggingface/async_tp
Async tp
|
2024-12-14 07:24:35 -05:00 |
|
zzhhjjj
|
481eeb8377
|
stop iteration fix. recreate a new dataloder
|
2024-12-13 13:29:11 +00:00 |
|
ferdinand.mom
|
b390a0101e
|
add mfu parsing
|
2024-12-04 13:08:28 +00:00 |
|
ferdinand.mom
|
aaa4a083e9
|
remove clone() in tp communications as torch.compile will optimize this out anyway
|
2024-12-03 16:26:41 +00:00 |
|
ferdinand.mom
|
09dfd1676f
|
broadcast tokenizer to every rank as well
|
2024-12-03 14:20:44 +00:00 |
|
ferdinand.mom
|
00ddbd9d2e
|
raise Exception when not enough layers to distributed in rank + rename variable
|
2024-12-03 13:17:52 +00:00 |
|
ferdinand.mom
|
b80091e8ec
|
set 1f1b by default
|
2024-12-03 10:10:12 +00:00 |
|
ferdinand.mom
|
86a0fc5e3d
|
avois OS memory error with num_workers > 1
|
2024-12-02 18:33:35 +00:00 |
|
ferdinand.mom
|
75939867d9
|
small fix on world_size with pgm
|
2024-12-02 18:12:02 +00:00 |
|
ferdinand.mom
|
b6267c768e
|
fix issue with too many rank reading to HF library
|
2024-12-02 15:36:47 +00:00 |
|
Ferdinand Mom
|
52a9779345
|
Merge branch 'main' into loading_big_model
|
2024-12-01 16:34:43 -04:00 |
|
ferdinand.mom
|
a84a9d5942
|
now handle TP + PP meta device
|
2024-12-01 20:26:40 +00:00 |
|
ferdinand.mom
|
bccee5d037
|
clean long line of hyperparameters
|
2024-12-01 20:00:05 +00:00 |
|
ferdinand.mom
|
804f43c97e
|
more consistent naming
|
2024-12-01 19:45:11 +00:00 |
|
ferdinand.mom
|
32d8daa880
|
can now load big model through safetensors (sharded and single file)
|
2024-12-01 19:39:16 +00:00 |
|
ferdinand.mom
|
012aad3167
|
fix extract metrics
|
2024-12-01 03:43:04 +00:00 |
|
ferdinand.mom
|
3c6c1e3af1
|
add reset parameters for initialize_model_with_materialized_weights
|
2024-12-01 03:43:04 +00:00 |
|
ferdinand.mom
|
270c469531
|
refactor tensor parallel
|
2024-12-01 03:43:04 +00:00 |
|
ferdinand.mom
|
daea1fed3f
|
refactor checkpoint
|
2024-12-01 03:43:00 +00:00 |
|
ferdinand.mom
|
5045be87e0
|
wip: load big model with meta device
|
2024-11-29 16:38:42 +00:00 |
|
zzhhjjj
|
069a17237f
|
update wandb_log + set async default to true
|
2024-11-21 17:48:26 +00:00 |
|
zzhhjjj
|
ca1fcec87f
|
Merge branch 'main' into async_tp
|
2024-11-20 02:00:33 +00:00 |
|
zzhhjjj
|
ebef9a36e3
|
remove redundancy
|
2024-11-20 01:58:44 +00:00 |
|
zzhhjjj
|
2d6d9fb6b1
|
async TP + test
|
2024-11-20 01:55:02 +00:00 |
|
zzhhjjj
|
16d85cdb3a
|
mfu ref/typo
|
2024-11-18 17:57:02 +00:00 |
|
Ferdinand Mom
|
a2ce795837
|
Merge pull request #8 from huggingface/add_mfu
add mfu, get number of parameters
|
2024-11-18 13:45:32 -04:00 |
|
zzhhjjj
|
191f7425e1
|
add mfu, get number of parameters
|
2024-11-18 17:36:51 +00:00 |
|