Commit Graph

2040 Commits

Author SHA1 Message Date
Thomas Parnell
2f808e69ab
[Bugfix] StatLoggers: cache spec decode metrics when they get collected. (#6645)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-23 23:05:05 +00:00
Michael Goin
01c16ede6b
[CI] Add smoke test for non-uniform AutoFP8 quantization (#6702) 2024-07-23 22:45:12 +00:00
youkaichao
72fc704803
[build] relax wheel size limit (#6704) 2024-07-23 14:03:49 -07:00
Roger Wang
1bedf210e3
Bump transformers version for Llama 3.1 hotfix and patch Chameleon (#6690) 2024-07-23 13:47:48 -07:00
Travis Johnson
507ef787d8
[Model] Pipeline Parallel Support for DeepSeek v2 (#6519)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-07-23 12:22:09 -07:00
Yehoshua Cohen
58f53034ad
[Frontend] Add Usage data in each chunk for chat_serving. #6540 (#6652) 2024-07-23 11:41:55 -07:00
Michael Goin
0eb0757bef
[Misc] Add ignored layers for fp8 quantization (#6657) 2024-07-23 14:04:04 -04:00
Simon Mo
38c4b7e863
Bump version to 0.5.3.post1 (#6696) 2024-07-23 10:08:59 -07:00
Woosuk Kwon
a112a84aad
[BugFix] Fix RoPE error in Llama 3.1 (#6693) 2024-07-23 09:46:05 -07:00
Woosuk Kwon
461089a21a
[Bugfix] Fix a log error in chunked prefill (#6694) 2024-07-23 09:27:58 -07:00
youkaichao
71950af726
[doc][distributed] fix doc argument order (#6691) 2024-07-23 08:55:33 -07:00
Woosuk Kwon
cb1362a889
[Docs] Announce llama3.1 support (#6688) 2024-07-23 08:18:15 -07:00
Simon Mo
bb2fc08072
Bump version to v0.5.3 (#6674) 2024-07-23 00:00:08 -07:00
Simon Mo
3eda4ec780
support ignore patterns in model loader (#6673) 2024-07-22 23:59:42 -07:00
Roger Wang
22fa2e35cb
[VLM][Model] Support image input for Chameleon (#6633) 2024-07-22 23:50:48 -07:00
youkaichao
c5201240a4
[misc] only tqdm for first rank (#6672) 2024-07-22 21:57:27 -07:00
Cyrus Leung
97234be0ec
[Misc] Manage HTTP connections in one place (#6600) 2024-07-22 21:32:02 -07:00
youkaichao
c051bfe4eb
[doc][distributed] doc for setting up multi-node environment (#6529)
[doc][distributed] add more doc for setting up multi-node environment (#6529)
2024-07-22 21:22:09 -07:00
Michael Goin
9e0b558a09
[Misc] Support FP8 kv cache scales from compressed-tensors (#6528) 2024-07-23 04:11:50 +00:00
zhaotyer
e519ae097a
add tqdm when loading checkpoint shards (#6569)
Co-authored-by: tianyi.zhao <tianyi.zhao@transwarp.io>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-07-22 20:48:01 -07:00
youkaichao
7c2749a4fd
[misc] add start loading models for users information (#6670) 2024-07-22 20:08:02 -07:00
Woosuk Kwon
729171ae58
[Misc] Enable chunked prefill by default for long context models (#6666) 2024-07-22 20:03:13 -07:00
Cheng Li
c5e8330997
[Bugfix] Fix null modules_to_not_convert in FBGEMM Fp8 quantization (#6665) 2024-07-22 19:25:05 -07:00
Cody Yu
e0c15758b8
[Core] Modulize prepare input and attention metadata builder (#6596) 2024-07-23 00:45:24 +00:00
Woosuk Kwon
bdf5fd1386
[Misc] Remove deprecation warning for beam search (#6659) 2024-07-23 00:21:58 +00:00
youkaichao
5a96ee52a3
[ci][build] add back vim in docker (#6661) 2024-07-22 16:26:29 -07:00
Jiaxin Shan
42c7f66a38
[Core] Support dynamically loading Lora adapter from HuggingFace (#6234)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-22 15:42:40 -07:00
Kevin H. Luu
69d5ae38dc
[ci] Use different sccache bucket for CUDA 11.8 wheel build (#6656)
Signed-off-by: kevin <kevin@anyscale.com>
2024-07-22 14:20:41 -07:00
Tyler Michael Smith
fea59c7712
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649) 2024-07-22 14:08:30 -06:00
Cyrus Leung
739b61a348
[Frontend] Refactor prompt processing (#4028)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-22 10:13:53 -07:00
Jae-Won Chung
89c1c6a196
[Bugfix] Fix vocab_size field access in llava_next.py (#6624) 2024-07-22 05:02:51 +00:00
Woosuk Kwon
42de2cefcb
[Misc] Add a wrapper for torch.inference_mode (#6618) 2024-07-21 18:43:11 -07:00
Roger Wang
c9eef37f32
[Model] Initial Support for Chameleon (#5770) 2024-07-21 17:37:51 -07:00
Alexander Matveev
396d92d5e0
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612) 2024-07-21 19:41:42 -04:00
Isotr0py
25e778aa16
[Model] Refactor and decouple phi3v image embedding (#6621) 2024-07-21 16:07:58 -07:00
Woosuk Kwon
b6df37f943
[Misc] Remove abused noqa (#6619) 2024-07-21 23:47:04 +08:00
sroy745
14f91fe67c
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485) 2024-07-20 23:58:58 -07:00
Cyrus Leung
d7f4178dd9
[Frontend] Move chat utils (#6602)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-21 08:38:17 +08:00
Robert Shaw
082ecd80d5
[ Bugfix ] Fix AutoFP8 fp8 marlin (#6609) 2024-07-20 17:25:56 -06:00
Michael Goin
f952bbc8ff
[Misc] Fix input_scale typing in w8a8_utils.py (#6579) 2024-07-20 23:11:13 +00:00
Robert Shaw
9364f74eee
[ Kernel ] Enable fp8-marlin for fbgemm-fp8 models (#6606) 2024-07-20 18:50:10 +00:00
Matt Wong
06d6c5fe9f
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543) 2024-07-20 09:39:07 -07:00
Robert Shaw
683e3cb9c4
[ Misc ] fbgemm checkpoints (#6559) 2024-07-20 09:36:57 -07:00
Cyrus Leung
9042d68362
[Misc] Consolidate and optimize logic for building padded tensors (#6541) 2024-07-20 04:17:24 +00:00
Travis Johnson
3f8d42c81f
Pipeline Parallel: Guard for KeyErrors at request abort (#6587)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-07-19 19:18:19 -07:00
Antoni Baum
7bd82002ae
[Core] Allow specifying custom Executor (#6557) 2024-07-20 01:25:06 +00:00
Varun Sundar Rabindranath
2e26564259
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593)
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
2024-07-19 18:15:26 -07:00
youkaichao
e81522e879
[build] add ib in image for out-of-the-box infiniband support (#6599)
[build] add ib so that multi-node support with infiniband can be supported out-of-the-box (#6599)
2024-07-19 17:16:57 -07:00
Murali Andoorveedu
45ceb85a0c
[Docs] Update PP docs (#6598) 2024-07-19 16:38:21 -07:00
Robert Shaw
4cc24f01b1
[ Kernel ] Enable Dynamic Per Token fp8 (#6547) 2024-07-19 23:08:15 +00:00