ferdinand.mom
|
b647f58289
|
fix stuff to make it CPU compliants
|
2024-12-18 16:50:36 +00:00 |
|
ferdinand.mom
|
75cd0d77f9
|
download safetensors when creating config time. If we do it in training, barrier() may tiemout while waiting for download
|
2024-12-17 15:46:16 +00:00 |
|
ferdinand.mom
|
b57b8277d1
|
breaking: add new version of initi meta device but memory leaks
|
2024-12-17 15:46:16 +00:00 |
|
ferdinand.mom
|
859650a2c0
|
breaking: refactor loading big model to only download safetensors files
|
2024-12-17 15:46:09 +00:00 |
|
ferdinand.mom
|
43f39ff9ec
|
Merge remote-tracking branch 'origin/main' into loading_big_model
|
2024-12-17 15:45:38 +00:00 |
|
ferdinand.mom
|
09dfd1676f
|
broadcast tokenizer to every rank as well
|
2024-12-03 14:20:44 +00:00 |
|
ferdinand.mom
|
00ddbd9d2e
|
raise Exception when not enough layers to distributed in rank + rename variable
|
2024-12-03 13:17:52 +00:00 |
|
ferdinand.mom
|
75939867d9
|
small fix on world_size with pgm
|
2024-12-02 18:12:02 +00:00 |
|
ferdinand.mom
|
b6267c768e
|
fix issue with too many rank reading to HF library
|
2024-12-02 15:36:47 +00:00 |
|
Ferdinand Mom
|
52a9779345
|
Merge branch 'main' into loading_big_model
|
2024-12-01 16:34:43 -04:00 |
|
ferdinand.mom
|
a84a9d5942
|
now handle TP + PP meta device
|
2024-12-01 20:26:40 +00:00 |
|
ferdinand.mom
|
bccee5d037
|
clean long line of hyperparameters
|
2024-12-01 20:00:05 +00:00 |
|
ferdinand.mom
|
804f43c97e
|
more consistent naming
|
2024-12-01 19:45:11 +00:00 |
|
ferdinand.mom
|
32d8daa880
|
can now load big model through safetensors (sharded and single file)
|
2024-12-01 19:39:16 +00:00 |
|
ferdinand.mom
|
3c6c1e3af1
|
add reset parameters for initialize_model_with_materialized_weights
|
2024-12-01 03:43:04 +00:00 |
|
ferdinand.mom
|
daea1fed3f
|
refactor checkpoint
|
2024-12-01 03:43:00 +00:00 |
|
ferdinand.mom
|
5045be87e0
|
wip: load big model with meta device
|
2024-11-29 16:38:42 +00:00 |
|
zzhhjjj
|
069a17237f
|
update wandb_log + set async default to true
|
2024-11-21 17:48:26 +00:00 |
|
zzhhjjj
|
191f7425e1
|
add mfu, get number of parameters
|
2024-11-18 17:36:51 +00:00 |
|
ferdinand.mom
|
7f11b912aa
|
some fix
|
2024-11-04 16:57:00 +00:00 |
|
ferdinand.mom
|
bdaf0d1a1c
|
some cleaning in train
|
2024-11-04 16:54:49 +00:00 |
|
ferdinand.mom
|
b33a5c8e5d
|
better api when applying parallelism to train
|
2024-11-04 16:52:08 +00:00 |
|
ferdinand.mom
|
77e85fe490
|
split/merge into different files tp and dp
|
2024-11-04 16:26:11 +00:00 |
|
ferdinand.mom
|
db926026a6
|
folder refactoring + split cp & pp communications near implementation details
|
2024-11-04 16:10:47 +00:00 |
|
ferdinand.mom
|
243f088170
|
separate dataloader from utils to data.py
|
2024-11-04 15:36:01 +00:00 |
|
ferdinand.mom
|
a5706858e0
|
move utils to picotron
|
2024-11-04 15:33:39 +00:00 |
|
ferdinand.mom
|
8af19d0caa
|
picotron top level folder
|
2024-11-04 15:29:26 +00:00 |
|
ferdinand.mom
|
41f49bb15f
|
rename to grad_steps
|
2024-11-04 15:06:29 +00:00 |
|
ferdinand.mom
|
0bfc06506a
|
small changes unrelated to dp+pp sync grad fix
|
2024-11-04 15:00:43 +00:00 |
|
ferdinand.mom
|
814e2a96ad
|
fix multi-node training by using global rank instead of local rank for dist.init_process_group
|
2024-11-04 14:48:03 +00:00 |
|
ferdinand.mom
|
7bfdf5f7d1
|
add fuse adam
|
2024-11-04 14:35:36 +00:00 |
|
ferdinand.mom
|
519b506b2b
|
add option to switch between pp engine
|
2024-11-04 14:32:44 +00:00 |
|
ferdinand.mom
|
f6c9a39d17
|
fix spliting input twice for context parallel (done in dataloader)
|
2024-10-30 15:43:42 +00:00 |
|
ferdinand.mom
|
363dbd5c05
|
need to update max position embeding when seq_len is greater (for rope)
|
2024-10-30 15:12:06 +00:00 |
|
ferdinand.mom
|
fdf2df8344
|
add wandb
eaezaeea
|
2024-10-30 14:58:41 +00:00 |
|
ferdinand.mom
|
1dbe034d57
|
better config creation
|
2024-10-30 14:58:41 +00:00 |
|
zzhhjjj
|
c7a3fb016a
|
disable grad sync in afab
|
2024-10-30 14:58:40 +00:00 |
|
ferdinand.mom
|
47c00be8c7
|
breaking: add slurm stuff
|
2024-10-29 15:44:35 +00:00 |
|
ferdinand.mom
|
46af5b0425
|
some fixes
|
2024-10-29 14:17:42 +00:00 |
|
zzhhjjj
|
b7f3e253be
|
add context parallel
|
2024-10-29 13:42:38 +00:00 |
|
zzhhjjj
|
6220892716
|
refactor
|
2024-10-28 20:44:15 +00:00 |
|
zzhhjjj
|
5181c4cd87
|
typo
|
2024-10-28 11:02:38 +00:00 |
|
zzhhjjj
|
a17ddb691f
|
gradient accumulation
|
2024-10-28 07:46:23 +00:00 |
|
zzhhjjj
|
2f8c87f4d1
|
save/load weights
|
2024-10-28 05:19:59 +00:00 |
|
zzhhjjj
|
762127afcd
|
some logs,will clean later
|
2024-10-27 02:22:36 +00:00 |
|
zzhhjjj
|
63307c79a1
|
add some logs, refactor dataloader
|
2024-10-23 00:38:27 +00:00 |
|
zzhhjjj
|
ec1e1e5ccf
|
support bf16, all reduce loss
|
2024-10-22 23:38:44 +00:00 |
|
zzhhjjj
|
a6d79b07b5
|
add cuda kernels
|
2024-10-22 22:38:29 +00:00 |
|
zzhhjjj
|
9a7904d5d6
|
revert some change
|
2024-10-22 19:50:23 +00:00 |
|
ferdinand.mom
|
2b2781a374
|
made Tensor Parallel API compliant
|
2024-10-18 15:51:26 +00:00 |
|