ferdinand.mom
|
43f39ff9ec
|
Merge remote-tracking branch 'origin/main' into loading_big_model
|
2024-12-17 15:45:38 +00:00 |
|
ferdinand.mom
|
b0ea5066ad
|
small changes
|
2024-12-17 05:01:35 +00:00 |
|
Haojun Zhao
|
55efb321f9
|
Merge pull request #9 from huggingface/async_tp
Async tp
|
2024-12-14 07:24:35 -05:00 |
|
zzhhjjj
|
481eeb8377
|
stop iteration fix. recreate a new dataloder
|
2024-12-13 13:29:11 +00:00 |
|
ferdinand.mom
|
b390a0101e
|
add mfu parsing
|
2024-12-04 13:08:28 +00:00 |
|
ferdinand.mom
|
aaa4a083e9
|
remove clone() in tp communications as torch.compile will optimize this out anyway
|
2024-12-03 16:26:41 +00:00 |
|
ferdinand.mom
|
09dfd1676f
|
broadcast tokenizer to every rank as well
|
2024-12-03 14:20:44 +00:00 |
|
ferdinand.mom
|
00ddbd9d2e
|
raise Exception when not enough layers to distributed in rank + rename variable
|
2024-12-03 13:17:52 +00:00 |
|
ferdinand.mom
|
b80091e8ec
|
set 1f1b by default
|
2024-12-03 10:10:12 +00:00 |
|
ferdinand.mom
|
86a0fc5e3d
|
avois OS memory error with num_workers > 1
|
2024-12-02 18:33:35 +00:00 |
|
ferdinand.mom
|
75939867d9
|
small fix on world_size with pgm
|
2024-12-02 18:12:02 +00:00 |
|
ferdinand.mom
|
b6267c768e
|
fix issue with too many rank reading to HF library
|
2024-12-02 15:36:47 +00:00 |
|
Ferdinand Mom
|
52a9779345
|
Merge branch 'main' into loading_big_model
|
2024-12-01 16:34:43 -04:00 |
|
ferdinand.mom
|
a84a9d5942
|
now handle TP + PP meta device
|
2024-12-01 20:26:40 +00:00 |
|
ferdinand.mom
|
bccee5d037
|
clean long line of hyperparameters
|
2024-12-01 20:00:05 +00:00 |
|
ferdinand.mom
|
804f43c97e
|
more consistent naming
|
2024-12-01 19:45:11 +00:00 |
|
ferdinand.mom
|
32d8daa880
|
can now load big model through safetensors (sharded and single file)
|
2024-12-01 19:39:16 +00:00 |
|
ferdinand.mom
|
012aad3167
|
fix extract metrics
|
2024-12-01 03:43:04 +00:00 |
|
ferdinand.mom
|
3c6c1e3af1
|
add reset parameters for initialize_model_with_materialized_weights
|
2024-12-01 03:43:04 +00:00 |
|
ferdinand.mom
|
270c469531
|
refactor tensor parallel
|
2024-12-01 03:43:04 +00:00 |
|
ferdinand.mom
|
daea1fed3f
|
refactor checkpoint
|
2024-12-01 03:43:00 +00:00 |
|
ferdinand.mom
|
5045be87e0
|
wip: load big model with meta device
|
2024-11-29 16:38:42 +00:00 |
|
zzhhjjj
|
069a17237f
|
update wandb_log + set async default to true
|
2024-11-21 17:48:26 +00:00 |
|
zzhhjjj
|
ca1fcec87f
|
Merge branch 'main' into async_tp
|
2024-11-20 02:00:33 +00:00 |
|
zzhhjjj
|
ebef9a36e3
|
remove redundancy
|
2024-11-20 01:58:44 +00:00 |
|
zzhhjjj
|
2d6d9fb6b1
|
async TP + test
|
2024-11-20 01:55:02 +00:00 |
|
zzhhjjj
|
16d85cdb3a
|
mfu ref/typo
|
2024-11-18 17:57:02 +00:00 |
|
Ferdinand Mom
|
a2ce795837
|
Merge pull request #8 from huggingface/add_mfu
add mfu, get number of parameters
|
2024-11-18 13:45:32 -04:00 |
|
zzhhjjj
|
191f7425e1
|
add mfu, get number of parameters
|
2024-11-18 17:36:51 +00:00 |
|
Ferdinand Mom
|
099621fd94
|
Merge pull request #5 from huggingface/add-grad-acc-pp
Add gradient accumulation to PP + fix DP integration with PP (1f1b)
|
2024-11-04 21:14:04 +01:00 |
|
ferdinand.mom
|
a191212dda
|
Merge branch 'main' into add-grad-acc-pp
|
2024-11-04 18:42:40 +00:00 |
|
Ferdinand Mom
|
ccf2a0a4ac
|
Merge pull request #7 from huggingface/refactoring
Refactoring
|
2024-11-04 19:39:34 +01:00 |
|
ferdinand.mom
|
7f11b912aa
|
some fix
|
2024-11-04 16:57:00 +00:00 |
|
ferdinand.mom
|
bdaf0d1a1c
|
some cleaning in train
|
2024-11-04 16:54:49 +00:00 |
|
ferdinand.mom
|
b33a5c8e5d
|
better api when applying parallelism to train
|
2024-11-04 16:52:08 +00:00 |
|
ferdinand.mom
|
77e85fe490
|
split/merge into different files tp and dp
|
2024-11-04 16:26:11 +00:00 |
|
ferdinand.mom
|
db926026a6
|
folder refactoring + split cp & pp communications near implementation details
|
2024-11-04 16:10:47 +00:00 |
|
ferdinand.mom
|
1a000975be
|
move model to picotron folder
|
2024-11-04 15:37:37 +00:00 |
|
ferdinand.mom
|
243f088170
|
separate dataloader from utils to data.py
|
2024-11-04 15:36:01 +00:00 |
|
ferdinand.mom
|
a5706858e0
|
move utils to picotron
|
2024-11-04 15:33:39 +00:00 |
|
ferdinand.mom
|
8af19d0caa
|
picotron top level folder
|
2024-11-04 15:29:26 +00:00 |
|
ferdinand.mom
|
e7b4722160
|
remove unecessary files
|
2024-11-04 15:27:53 +00:00 |
|
ferdinand.mom
|
a90a4d1f2e
|
Merge branch 'main' into add-grad-acc-pp
|
2024-11-04 15:07:10 +00:00 |
|
ferdinand.mom
|
41f49bb15f
|
rename to grad_steps
|
2024-11-04 15:06:29 +00:00 |
|
ferdinand.mom
|
37be871710
|
Merge branch 'main' into add-grad-acc-pp
|
2024-11-04 15:02:43 +00:00 |
|
ferdinand.mom
|
0bfc06506a
|
small changes unrelated to dp+pp sync grad fix
|
2024-11-04 15:00:43 +00:00 |
|
Ferdinand Mom
|
cecdafe515
|
Merge branch 'main' into add-grad-acc-pp
|
2024-11-04 15:56:31 +01:00 |
|
Ferdinand Mom
|
cce11da2cb
|
Merge pull request #6 from huggingface/pr1
various fix
|
2024-11-04 15:49:01 +01:00 |
|
ferdinand.mom
|
90868144a7
|
some dp renaming
|
2024-11-04 14:48:12 +00:00 |
|
ferdinand.mom
|
814e2a96ad
|
fix multi-node training by using global rank instead of local rank for dist.init_process_group
|
2024-11-04 14:48:03 +00:00 |
|