ferdinand.mom
|
daea1fed3f
|
refactor checkpoint
|
2024-12-01 03:43:00 +00:00 |
|
ferdinand.mom
|
5045be87e0
|
wip: load big model with meta device
|
2024-11-29 16:38:42 +00:00 |
|
ferdinand.mom
|
7f11b912aa
|
some fix
|
2024-11-04 16:57:00 +00:00 |
|
ferdinand.mom
|
bdaf0d1a1c
|
some cleaning in train
|
2024-11-04 16:54:49 +00:00 |
|
ferdinand.mom
|
b33a5c8e5d
|
better api when applying parallelism to train
|
2024-11-04 16:52:08 +00:00 |
|
ferdinand.mom
|
77e85fe490
|
split/merge into different files tp and dp
|
2024-11-04 16:26:11 +00:00 |
|
ferdinand.mom
|
db926026a6
|
folder refactoring + split cp & pp communications near implementation details
|
2024-11-04 16:10:47 +00:00 |
|
ferdinand.mom
|
243f088170
|
separate dataloader from utils to data.py
|
2024-11-04 15:36:01 +00:00 |
|
ferdinand.mom
|
a5706858e0
|
move utils to picotron
|
2024-11-04 15:33:39 +00:00 |
|
ferdinand.mom
|
8af19d0caa
|
picotron top level folder
|
2024-11-04 15:29:26 +00:00 |
|
ferdinand.mom
|
41f49bb15f
|
rename to grad_steps
|
2024-11-04 15:06:29 +00:00 |
|
ferdinand.mom
|
0bfc06506a
|
small changes unrelated to dp+pp sync grad fix
|
2024-11-04 15:00:43 +00:00 |
|
ferdinand.mom
|
814e2a96ad
|
fix multi-node training by using global rank instead of local rank for dist.init_process_group
|
2024-11-04 14:48:03 +00:00 |
|
ferdinand.mom
|
7bfdf5f7d1
|
add fuse adam
|
2024-11-04 14:35:36 +00:00 |
|
ferdinand.mom
|
519b506b2b
|
add option to switch between pp engine
|
2024-11-04 14:32:44 +00:00 |
|
ferdinand.mom
|
f6c9a39d17
|
fix spliting input twice for context parallel (done in dataloader)
|
2024-10-30 15:43:42 +00:00 |
|
ferdinand.mom
|
363dbd5c05
|
need to update max position embeding when seq_len is greater (for rope)
|
2024-10-30 15:12:06 +00:00 |
|
ferdinand.mom
|
fdf2df8344
|
add wandb
eaezaeea
|
2024-10-30 14:58:41 +00:00 |
|
ferdinand.mom
|
1dbe034d57
|
better config creation
|
2024-10-30 14:58:41 +00:00 |
|
zzhhjjj
|
c7a3fb016a
|
disable grad sync in afab
|
2024-10-30 14:58:40 +00:00 |
|
ferdinand.mom
|
47c00be8c7
|
breaking: add slurm stuff
|
2024-10-29 15:44:35 +00:00 |
|
ferdinand.mom
|
46af5b0425
|
some fixes
|
2024-10-29 14:17:42 +00:00 |
|
zzhhjjj
|
b7f3e253be
|
add context parallel
|
2024-10-29 13:42:38 +00:00 |
|
zzhhjjj
|
6220892716
|
refactor
|
2024-10-28 20:44:15 +00:00 |
|
zzhhjjj
|
5181c4cd87
|
typo
|
2024-10-28 11:02:38 +00:00 |
|
zzhhjjj
|
a17ddb691f
|
gradient accumulation
|
2024-10-28 07:46:23 +00:00 |
|
zzhhjjj
|
2f8c87f4d1
|
save/load weights
|
2024-10-28 05:19:59 +00:00 |
|
zzhhjjj
|
762127afcd
|
some logs,will clean later
|
2024-10-27 02:22:36 +00:00 |
|
zzhhjjj
|
63307c79a1
|
add some logs, refactor dataloader
|
2024-10-23 00:38:27 +00:00 |
|
zzhhjjj
|
ec1e1e5ccf
|
support bf16, all reduce loss
|
2024-10-22 23:38:44 +00:00 |
|
zzhhjjj
|
a6d79b07b5
|
add cuda kernels
|
2024-10-22 22:38:29 +00:00 |
|
zzhhjjj
|
9a7904d5d6
|
revert some change
|
2024-10-22 19:50:23 +00:00 |
|
ferdinand.mom
|
2b2781a374
|
made Tensor Parallel API compliant
|
2024-10-18 15:51:26 +00:00 |
|
ferdinand.mom
|
abd1edf9f9
|
all_reduce loss across pp/dp ranks + base_parallel
|
2024-10-18 15:51:17 +00:00 |
|
ferdinand.mom
|
1ebd3de5be
|
Merge DDP + TP from @zzhhjjj
|
2024-10-18 15:05:01 +00:00 |
|
ferdinand.mom
|
83ddda2ce8
|
leave out CP integration at the very end
|
2024-10-18 14:59:39 +00:00 |
|
ferdinand.mom
|
0b1d02a402
|
various fix (modeling, dataloader, cpu load)
|
2024-10-18 14:33:46 +00:00 |
|
zzhhjjj
|
7377238741
|
tesnsor parallel, will clean later
|
2024-10-18 05:13:44 +00:00 |
|
zzhhjjj
|
54ad77e055
|
Merge branch 'main' into ddp-merge
|
2024-10-16 19:13:48 +00:00 |
|
zzhhjjj
|
24ff8d05fd
|
add DDP
|
2024-10-16 16:48:55 +00:00 |
|
zzhhjjj
|
6080c2f26b
|
dataloader
|
2024-10-16 15:58:35 +00:00 |
|
ferdinand.mom
|
81726dfffe
|
accelerate dataset mapping
|
2024-10-15 13:32:44 +00:00 |
|
ferdinand.mom
|
1ca7365506
|
stitch cp dp cp together
|
2024-10-15 13:06:17 +00:00 |
|
ferdinand.mom
|
ffea3d2ad1
|
add context parallel for training
|
2024-10-15 12:43:28 +00:00 |
|
ferdinand.mom
|
1e229cae88
|
renaming
|
2024-10-14 09:26:31 +00:00 |
|
ferdinand.mom
|
3095ff4d4f
|
refactor organisation
|
2024-10-10 15:12:14 +00:00 |
|
ferdinand.mom
|
47581d29e9
|
make new modeling compatible with training
|
2024-10-10 15:08:23 +00:00 |
|
ferdinand.mom
|
8c155f47ce
|
all reduce gradient across DP & CP ranks
|
2024-09-26 14:00:06 +00:00 |
|
ferdinand.mom
|
31b5fb9efc
|
ugly ass display of grid (to be changed)
|
2024-09-26 13:45:53 +00:00 |
|
ferdinand.mom
|
b8065de7aa
|
support CPU training through gloo backend
|
2024-09-26 10:27:20 +00:00 |
|