Commit Graph

76 Commits

Author SHA1 Message Date
ferdinand.mom
b647f58289 fix stuff to make it CPU compliants 2024-12-18 16:50:36 +00:00
ferdinand.mom
75cd0d77f9 download safetensors when creating config time. If we do it in training, barrier() may tiemout while waiting for download 2024-12-17 15:46:16 +00:00
ferdinand.mom
b57b8277d1 breaking: add new version of initi meta device but memory leaks 2024-12-17 15:46:16 +00:00
ferdinand.mom
859650a2c0 breaking: refactor loading big model to only download safetensors files 2024-12-17 15:46:09 +00:00
ferdinand.mom
43f39ff9ec Merge remote-tracking branch 'origin/main' into loading_big_model 2024-12-17 15:45:38 +00:00
ferdinand.mom
09dfd1676f broadcast tokenizer to every rank as well 2024-12-03 14:20:44 +00:00
ferdinand.mom
00ddbd9d2e raise Exception when not enough layers to distributed in rank + rename variable 2024-12-03 13:17:52 +00:00
ferdinand.mom
75939867d9 small fix on world_size with pgm 2024-12-02 18:12:02 +00:00
ferdinand.mom
b6267c768e fix issue with too many rank reading to HF library 2024-12-02 15:36:47 +00:00
Ferdinand Mom
52a9779345
Merge branch 'main' into loading_big_model 2024-12-01 16:34:43 -04:00
ferdinand.mom
a84a9d5942 now handle TP + PP meta device 2024-12-01 20:26:40 +00:00
ferdinand.mom
bccee5d037 clean long line of hyperparameters 2024-12-01 20:00:05 +00:00
ferdinand.mom
804f43c97e more consistent naming 2024-12-01 19:45:11 +00:00
ferdinand.mom
32d8daa880 can now load big model through safetensors (sharded and single file) 2024-12-01 19:39:16 +00:00
ferdinand.mom
3c6c1e3af1 add reset parameters for initialize_model_with_materialized_weights 2024-12-01 03:43:04 +00:00
ferdinand.mom
daea1fed3f refactor checkpoint 2024-12-01 03:43:00 +00:00
ferdinand.mom
5045be87e0 wip: load big model with meta device 2024-11-29 16:38:42 +00:00
zzhhjjj
069a17237f update wandb_log + set async default to true 2024-11-21 17:48:26 +00:00
zzhhjjj
191f7425e1 add mfu, get number of parameters 2024-11-18 17:36:51 +00:00
ferdinand.mom
7f11b912aa some fix 2024-11-04 16:57:00 +00:00
ferdinand.mom
bdaf0d1a1c some cleaning in train 2024-11-04 16:54:49 +00:00
ferdinand.mom
b33a5c8e5d better api when applying parallelism to train 2024-11-04 16:52:08 +00:00
ferdinand.mom
77e85fe490 split/merge into different files tp and dp 2024-11-04 16:26:11 +00:00
ferdinand.mom
db926026a6 folder refactoring + split cp & pp communications near implementation details 2024-11-04 16:10:47 +00:00
ferdinand.mom
243f088170 separate dataloader from utils to data.py 2024-11-04 15:36:01 +00:00
ferdinand.mom
a5706858e0 move utils to picotron 2024-11-04 15:33:39 +00:00
ferdinand.mom
8af19d0caa picotron top level folder 2024-11-04 15:29:26 +00:00
ferdinand.mom
41f49bb15f rename to grad_steps 2024-11-04 15:06:29 +00:00
ferdinand.mom
0bfc06506a small changes unrelated to dp+pp sync grad fix 2024-11-04 15:00:43 +00:00
ferdinand.mom
814e2a96ad fix multi-node training by using global rank instead of local rank for dist.init_process_group 2024-11-04 14:48:03 +00:00
ferdinand.mom
7bfdf5f7d1 add fuse adam 2024-11-04 14:35:36 +00:00
ferdinand.mom
519b506b2b add option to switch between pp engine 2024-11-04 14:32:44 +00:00
ferdinand.mom
f6c9a39d17 fix spliting input twice for context parallel (done in dataloader) 2024-10-30 15:43:42 +00:00
ferdinand.mom
363dbd5c05 need to update max position embeding when seq_len is greater (for rope) 2024-10-30 15:12:06 +00:00
ferdinand.mom
fdf2df8344 add wandb
eaezaeea
2024-10-30 14:58:41 +00:00
ferdinand.mom
1dbe034d57 better config creation 2024-10-30 14:58:41 +00:00
zzhhjjj
c7a3fb016a disable grad sync in afab 2024-10-30 14:58:40 +00:00
ferdinand.mom
47c00be8c7 breaking: add slurm stuff 2024-10-29 15:44:35 +00:00
ferdinand.mom
46af5b0425 some fixes 2024-10-29 14:17:42 +00:00
zzhhjjj
b7f3e253be add context parallel 2024-10-29 13:42:38 +00:00
zzhhjjj
6220892716 refactor 2024-10-28 20:44:15 +00:00
zzhhjjj
5181c4cd87 typo 2024-10-28 11:02:38 +00:00
zzhhjjj
a17ddb691f gradient accumulation 2024-10-28 07:46:23 +00:00
zzhhjjj
2f8c87f4d1 save/load weights 2024-10-28 05:19:59 +00:00
zzhhjjj
762127afcd some logs,will clean later 2024-10-27 02:22:36 +00:00
zzhhjjj
63307c79a1 add some logs, refactor dataloader 2024-10-23 00:38:27 +00:00
zzhhjjj
ec1e1e5ccf support bf16, all reduce loss 2024-10-22 23:38:44 +00:00
zzhhjjj
a6d79b07b5 add cuda kernels 2024-10-22 22:38:29 +00:00
zzhhjjj
9a7904d5d6 revert some change 2024-10-22 19:50:23 +00:00
ferdinand.mom
2b2781a374 made Tensor Parallel API compliant 2024-10-18 15:51:26 +00:00