Ferdinand Mom
|
099621fd94
|
Merge pull request #5 from huggingface/add-grad-acc-pp
Add gradient accumulation to PP + fix DP integration with PP (1f1b)
|
2024-11-04 21:14:04 +01:00 |
|
ferdinand.mom
|
a191212dda
|
Merge branch 'main' into add-grad-acc-pp
|
2024-11-04 18:42:40 +00:00 |
|
Ferdinand Mom
|
ccf2a0a4ac
|
Merge pull request #7 from huggingface/refactoring
Refactoring
|
2024-11-04 19:39:34 +01:00 |
|
ferdinand.mom
|
7f11b912aa
|
some fix
|
2024-11-04 16:57:00 +00:00 |
|
ferdinand.mom
|
bdaf0d1a1c
|
some cleaning in train
|
2024-11-04 16:54:49 +00:00 |
|
ferdinand.mom
|
b33a5c8e5d
|
better api when applying parallelism to train
|
2024-11-04 16:52:08 +00:00 |
|
ferdinand.mom
|
77e85fe490
|
split/merge into different files tp and dp
|
2024-11-04 16:26:11 +00:00 |
|
ferdinand.mom
|
db926026a6
|
folder refactoring + split cp & pp communications near implementation details
|
2024-11-04 16:10:47 +00:00 |
|
ferdinand.mom
|
1a000975be
|
move model to picotron folder
|
2024-11-04 15:37:37 +00:00 |
|
ferdinand.mom
|
243f088170
|
separate dataloader from utils to data.py
|
2024-11-04 15:36:01 +00:00 |
|
ferdinand.mom
|
a5706858e0
|
move utils to picotron
|
2024-11-04 15:33:39 +00:00 |
|
ferdinand.mom
|
8af19d0caa
|
picotron top level folder
|
2024-11-04 15:29:26 +00:00 |
|
ferdinand.mom
|
e7b4722160
|
remove unecessary files
|
2024-11-04 15:27:53 +00:00 |
|
ferdinand.mom
|
a90a4d1f2e
|
Merge branch 'main' into add-grad-acc-pp
|
2024-11-04 15:07:10 +00:00 |
|
ferdinand.mom
|
41f49bb15f
|
rename to grad_steps
|
2024-11-04 15:06:29 +00:00 |
|
ferdinand.mom
|
37be871710
|
Merge branch 'main' into add-grad-acc-pp
|
2024-11-04 15:02:43 +00:00 |
|
ferdinand.mom
|
0bfc06506a
|
small changes unrelated to dp+pp sync grad fix
|
2024-11-04 15:00:43 +00:00 |
|
Ferdinand Mom
|
cecdafe515
|
Merge branch 'main' into add-grad-acc-pp
|
2024-11-04 15:56:31 +01:00 |
|
Ferdinand Mom
|
cce11da2cb
|
Merge pull request #6 from huggingface/pr1
various fix
|
2024-11-04 15:49:01 +01:00 |
|
ferdinand.mom
|
90868144a7
|
some dp renaming
|
2024-11-04 14:48:12 +00:00 |
|
ferdinand.mom
|
814e2a96ad
|
fix multi-node training by using global rank instead of local rank for dist.init_process_group
|
2024-11-04 14:48:03 +00:00 |
|
ferdinand.mom
|
a44f905254
|
set num workers to 1 for now to avoid os memory error
|
2024-11-04 14:39:52 +00:00 |
|
ferdinand.mom
|
e19f74b715
|
add option for HF token
|
2024-11-04 14:39:12 +00:00 |
|
ferdinand.mom
|
7bfdf5f7d1
|
add fuse adam
|
2024-11-04 14:35:36 +00:00 |
|
ferdinand.mom
|
9d4f0ee4ff
|
fix requirements to avoid drop in throughput
|
2024-11-04 14:33:07 +00:00 |
|
ferdinand.mom
|
519b506b2b
|
add option to switch between pp engine
|
2024-11-04 14:32:44 +00:00 |
|
ferdinand.mom
|
7c381a61eb
|
lower timeout in train
|
2024-11-04 14:28:01 +00:00 |
|
ferdinand.mom
|
4e1a6f8cdd
|
set num worker to 1 otherwise OS memory error
|
2024-11-04 14:27:50 +00:00 |
|
ferdinand.mom
|
8e36bbe032
|
fix multi-node training by using global rank instead of local rank to init process_group
|
2024-11-03 00:14:14 +00:00 |
|
ferdinand.mom
|
b60b94b45f
|
fix requirements to avoid drop in throughput
|
2024-11-02 02:18:49 +00:00 |
|
ferdinand.mom
|
8741c9b167
|
change distributed option to pass to multi-node
|
2024-11-02 02:18:49 +00:00 |
|
ferdinand.mom
|
bd6b8a0972
|
add hf token + fix multi-node training with torchrun args
|
2024-11-02 02:18:40 +00:00 |
|
ferdinand.mom
|
486c1763a6
|
add fuse adam
|
2024-11-02 01:38:14 +00:00 |
|
ferdinand.mom
|
7996a318c1
|
fix DP integation within PP (1f1b)
|
2024-11-01 20:08:48 +00:00 |
|
ferdinand.mom
|
2bafa3117d
|
renaming + add option to switch between pp_engine (afab or 1f1b)
|
2024-11-01 20:07:57 +00:00 |
|
ferdinand.mom
|
f6c9a39d17
|
fix spliting input twice for context parallel (done in dataloader)
|
2024-10-30 15:43:42 +00:00 |
|
ferdinand.mom
|
363dbd5c05
|
need to update max position embeding when seq_len is greater (for rope)
|
2024-10-30 15:12:06 +00:00 |
|
ferdinand.mom
|
508d57f948
|
dont hardcode path
|
2024-10-30 14:58:41 +00:00 |
|
ferdinand.mom
|
f74bff79e0
|
cleaning
|
2024-10-30 14:58:41 +00:00 |
|
ferdinand.mom
|
2d198659e2
|
add slurm support
|
2024-10-30 14:58:41 +00:00 |
|
ferdinand.mom
|
fdf2df8344
|
add wandb
eaezaeea
|
2024-10-30 14:58:41 +00:00 |
|
ferdinand.mom
|
3c635092f9
|
add assert in TensorParallel for num_attention_heads and key_values_heads
|
2024-10-30 14:58:41 +00:00 |
|
ferdinand.mom
|
1dbe034d57
|
better config creation
|
2024-10-30 14:58:41 +00:00 |
|
ferdinand.mom
|
402aa4ccfc
|
small change
|
2024-10-30 14:58:41 +00:00 |
|
zzhhjjj
|
f1f6915ba1
|
1f1b fix
|
2024-10-30 14:58:41 +00:00 |
|
zzhhjjj
|
c7a3fb016a
|
disable grad sync in afab
|
2024-10-30 14:58:40 +00:00 |
|
ferdinand.mom
|
47c00be8c7
|
breaking: add slurm stuff
|
2024-10-29 15:44:35 +00:00 |
|
ferdinand.mom
|
987a7c5c99
|
add todo ring attention
|
2024-10-29 14:18:07 +00:00 |
|
ferdinand.mom
|
46af5b0425
|
some fixes
|
2024-10-29 14:17:42 +00:00 |
|
zzhhjjj
|
b7f3e253be
|
add context parallel
|
2024-10-29 13:42:38 +00:00 |
|