Commit Graph

  • df3ae8a5f0
    Update README.md main Ferdinand Mom 2024-12-20 13:56:34 +0100
  • bf03420686
    Update README.md Ferdinand Mom 2024-12-19 09:45:01 +0100
  • 164ab81e27
    Update README.md Ferdinand Mom 2024-12-19 09:24:14 +0100
  • 78ba56ce80
    Merge pull request #11 from eliebak/patch-1 Haojun Zhao 2024-12-19 03:19:53 -0500
  • 009bb0b2a8
    Update LICENSE elie 2024-12-19 09:14:13 +0100
  • 59b841e3cb
    Create LICENSE Ferdinand Mom 2024-12-19 08:57:47 +0100
  • 7ef2344cd4 refactor zzhhjjj 2024-12-19 07:05:16 +0000
  • d855aead9e readme zzhhjjj 2024-12-19 06:31:03 +0000
  • e82c719f31 Update Readme zzhhjjj 2024-12-19 06:04:04 +0000
  • a727b986cb refactor zzhhjjj 2024-12-19 05:48:29 +0000
  • 439e23fdba
    Merge pull request #10 from huggingface/loading_big_model Haojun Zhao 2024-12-19 00:29:16 -0500
  • 7daefd31ee Merge branch 'main' into loading_big_model ferdinand.mom 2024-12-18 17:02:48 +0000
  • b647f58289 fix stuff to make it CPU compliants ferdinand.mom 2024-12-18 16:50:36 +0000
  • b49ddac4b4 add tqdm when subprocess ferdinand.mom 2024-12-18 16:07:34 +0000
  • fc3b50b033 small change for dataloader arguments zzhhjjj 2024-12-18 15:55:55 +0000
  • f1053e3cbe revert to use huggingface cli + hf_transfers (this will not create snapshots/blob folder etc through CLI use) ferdinand.mom 2024-12-18 15:51:04 +0000
  • 0360ec0d2a use hf_transfer which improve download time by 3 ferdinand.mom 2024-12-18 14:51:14 +0000
  • 86c9b91d02 Revert to @zzhhjjj class naming as it is more expressive ferdinand.mom 2024-12-17 15:55:18 +0000
  • 75cd0d77f9 download safetensors when creating config time. If we do it in training, barrier() may tiemout while waiting for download ferdinand.mom 2024-12-17 15:41:00 +0000
  • b57b8277d1 breaking: add new version of initi meta device but memory leaks ferdinand.mom 2024-12-17 12:52:44 +0000
  • 859650a2c0 breaking: refactor loading big model to only download safetensors files ferdinand.mom 2024-12-17 09:13:44 +0000
  • 43f39ff9ec Merge remote-tracking branch 'origin/main' into loading_big_model ferdinand.mom 2024-12-17 05:30:26 +0000
  • b0ea5066ad small changes ferdinand.mom 2024-12-17 05:01:35 +0000
  • 55efb321f9
    Merge pull request #9 from huggingface/async_tp Haojun Zhao 2024-12-14 07:24:35 -0500
  • 481eeb8377 stop iteration fix. recreate a new dataloder zzhhjjj 2024-12-13 13:29:11 +0000
  • b390a0101e add mfu parsing ferdinand.mom 2024-12-04 13:08:28 +0000
  • aaa4a083e9 remove clone() in tp communications as torch.compile will optimize this out anyway ferdinand.mom 2024-12-03 16:26:41 +0000
  • 09dfd1676f broadcast tokenizer to every rank as well ferdinand.mom 2024-12-03 14:20:44 +0000
  • 00ddbd9d2e raise Exception when not enough layers to distributed in rank + rename variable ferdinand.mom 2024-12-03 13:17:52 +0000
  • b80091e8ec set 1f1b by default ferdinand.mom 2024-12-03 10:10:12 +0000
  • 86a0fc5e3d avois OS memory error with num_workers > 1 ferdinand.mom 2024-12-02 18:33:35 +0000
  • 75939867d9 small fix on world_size with pgm ferdinand.mom 2024-12-02 18:12:02 +0000
  • b6267c768e fix issue with too many rank reading to HF library ferdinand.mom 2024-12-02 15:36:47 +0000
  • 52a9779345
    Merge branch 'main' into loading_big_model Ferdinand Mom 2024-12-01 16:34:43 -0400
  • a84a9d5942 now handle TP + PP meta device ferdinand.mom 2024-12-01 20:26:40 +0000
  • bccee5d037 clean long line of hyperparameters ferdinand.mom 2024-12-01 20:00:05 +0000
  • 804f43c97e more consistent naming ferdinand.mom 2024-12-01 19:45:11 +0000
  • 32d8daa880 can now load big model through safetensors (sharded and single file) ferdinand.mom 2024-12-01 19:39:16 +0000
  • 012aad3167 fix extract metrics ferdinand.mom 2024-12-01 03:42:14 +0000
  • 3c6c1e3af1 add reset parameters for initialize_model_with_materialized_weights ferdinand.mom 2024-12-01 03:42:04 +0000
  • 270c469531 refactor tensor parallel ferdinand.mom 2024-12-01 03:41:15 +0000
  • daea1fed3f refactor checkpoint ferdinand.mom 2024-12-01 03:40:56 +0000
  • 5045be87e0 wip: load big model with meta device ferdinand.mom 2024-11-29 16:38:42 +0000
  • 069a17237f update wandb_log + set async default to true zzhhjjj 2024-11-21 17:48:26 +0000
  • ca1fcec87f Merge branch 'main' into async_tp zzhhjjj 2024-11-20 02:00:33 +0000
  • ebef9a36e3 remove redundancy zzhhjjj 2024-11-20 01:58:44 +0000
  • 2d6d9fb6b1 async TP + test zzhhjjj 2024-11-20 01:55:02 +0000
  • 16d85cdb3a mfu ref/typo zzhhjjj 2024-11-18 17:57:02 +0000
  • a2ce795837
    Merge pull request #8 from huggingface/add_mfu Ferdinand Mom 2024-11-18 13:45:32 -0400
  • 191f7425e1 add mfu, get number of parameters zzhhjjj 2024-11-18 17:36:51 +0000
  • 099621fd94
    Merge pull request #5 from huggingface/add-grad-acc-pp Ferdinand Mom 2024-11-04 21:14:04 +0100
  • a191212dda Merge branch 'main' into add-grad-acc-pp ferdinand.mom 2024-11-04 18:42:40 +0000
  • ccf2a0a4ac
    Merge pull request #7 from huggingface/refactoring Ferdinand Mom 2024-11-04 19:39:34 +0100
  • 7f11b912aa some fix ferdinand.mom 2024-11-04 16:57:00 +0000
  • bdaf0d1a1c some cleaning in train ferdinand.mom 2024-11-04 16:54:49 +0000
  • b33a5c8e5d better api when applying parallelism to train ferdinand.mom 2024-11-04 16:52:08 +0000
  • 77e85fe490 split/merge into different files tp and dp ferdinand.mom 2024-11-04 16:26:11 +0000
  • db926026a6 folder refactoring + split cp & pp communications near implementation details ferdinand.mom 2024-11-04 16:10:47 +0000
  • 1a000975be move model to picotron folder ferdinand.mom 2024-11-04 15:37:37 +0000
  • 243f088170 separate dataloader from utils to data.py ferdinand.mom 2024-11-04 15:36:01 +0000
  • a5706858e0 move utils to picotron ferdinand.mom 2024-11-04 15:32:20 +0000
  • 8af19d0caa picotron top level folder ferdinand.mom 2024-11-04 15:29:26 +0000
  • e7b4722160 remove unecessary files ferdinand.mom 2024-11-04 15:27:53 +0000
  • a90a4d1f2e Merge branch 'main' into add-grad-acc-pp ferdinand.mom 2024-11-04 15:07:10 +0000
  • 41f49bb15f rename to grad_steps ferdinand.mom 2024-11-04 15:06:29 +0000
  • 37be871710 Merge branch 'main' into add-grad-acc-pp ferdinand.mom 2024-11-04 15:02:43 +0000
  • 0bfc06506a small changes unrelated to dp+pp sync grad fix ferdinand.mom 2024-11-04 15:00:43 +0000
  • cecdafe515
    Merge branch 'main' into add-grad-acc-pp Ferdinand Mom 2024-11-04 15:56:31 +0100
  • cce11da2cb
    Merge pull request #6 from huggingface/pr1 Ferdinand Mom 2024-11-04 15:49:01 +0100
  • 90868144a7 some dp renaming ferdinand.mom 2024-11-04 14:41:11 +0000
  • 814e2a96ad fix multi-node training by using global rank instead of local rank for dist.init_process_group ferdinand.mom 2024-11-04 14:40:54 +0000
  • a44f905254 set num workers to 1 for now to avoid os memory error ferdinand.mom 2024-11-04 14:39:52 +0000
  • e19f74b715 add option for HF token ferdinand.mom 2024-11-04 14:39:12 +0000
  • 7bfdf5f7d1 add fuse adam ferdinand.mom 2024-11-04 14:35:36 +0000
  • 9d4f0ee4ff fix requirements to avoid drop in throughput ferdinand.mom 2024-11-04 14:33:07 +0000
  • 519b506b2b add option to switch between pp engine ferdinand.mom 2024-11-04 14:32:44 +0000
  • 7c381a61eb lower timeout in train ferdinand.mom 2024-11-04 14:28:01 +0000
  • 4e1a6f8cdd set num worker to 1 otherwise OS memory error ferdinand.mom 2024-11-04 14:27:50 +0000
  • 8e36bbe032 fix multi-node training by using global rank instead of local rank to init process_group ferdinand.mom 2024-11-03 00:14:14 +0000
  • b60b94b45f fix requirements to avoid drop in throughput ferdinand.mom 2024-11-02 02:17:06 +0000
  • 8741c9b167 change distributed option to pass to multi-node ferdinand.mom 2024-11-02 02:16:52 +0000
  • bd6b8a0972 add hf token + fix multi-node training with torchrun args ferdinand.mom 2024-11-02 01:37:53 +0000
  • 486c1763a6 add fuse adam ferdinand.mom 2024-11-02 01:18:56 +0000
  • 7996a318c1 fix DP integation within PP (1f1b) ferdinand.mom 2024-11-01 20:08:48 +0000
  • 2bafa3117d renaming + add option to switch between pp_engine (afab or 1f1b) ferdinand.mom 2024-10-30 20:44:08 +0000
  • f6c9a39d17 fix spliting input twice for context parallel (done in dataloader) ferdinand.mom 2024-10-30 15:23:29 +0000
  • 363dbd5c05 need to update max position embeding when seq_len is greater (for rope) ferdinand.mom 2024-10-30 15:12:06 +0000
  • 508d57f948 dont hardcode path ferdinand.mom 2024-10-30 14:49:48 +0000
  • f74bff79e0 cleaning ferdinand.mom 2024-10-30 14:29:22 +0000
  • 2d198659e2 add slurm support ferdinand.mom 2024-10-30 14:25:18 +0000
  • fdf2df8344 add wandb ferdinand.mom 2024-10-30 14:25:10 +0000
  • 3c635092f9 add assert in TensorParallel for num_attention_heads and key_values_heads ferdinand.mom 2024-10-30 14:04:45 +0000
  • 1dbe034d57 better config creation ferdinand.mom 2024-10-30 13:53:50 +0000
  • 402aa4ccfc small change ferdinand.mom 2024-10-30 12:50:27 +0000
  • f1f6915ba1 1f1b fix zzhhjjj 2024-10-29 21:03:58 +0000
  • c7a3fb016a disable grad sync in afab zzhhjjj 2024-10-29 20:58:04 +0000
  • 47c00be8c7 breaking: add slurm stuff ferdinand.mom 2024-10-29 15:44:35 +0000
  • 987a7c5c99 add todo ring attention ferdinand.mom 2024-10-29 14:08:53 +0000
  • 46af5b0425 some fixes ferdinand.mom 2024-10-29 14:08:08 +0000
  • b7f3e253be add context parallel zzhhjjj 2024-10-29 13:42:38 +0000