Rui Qiao
|
05308891e2
|
[Core] Pipeline parallel with Ray ADAG (#6837)
Support pipeline-parallelism with Ray accelerated DAG.
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
|
2024-08-02 13:55:40 -07:00 |
|
Lucas Wilkinson
|
a8d604ca2a
|
[Misc] Disambiguate quantized types via a new ScalarType (#6396)
|
2024-08-02 13:51:58 -07:00 |
|
youkaichao
|
806949514a
|
[ci] set timeout for test_oot_registration.py (#7082)
|
2024-08-02 10:03:24 -07:00 |
|
youkaichao
|
252357793d
|
[ci][distributed] try to fix pp test (#7054)
|
2024-08-01 22:03:12 -07:00 |
|
Woosuk Kwon
|
805a8a75f2
|
[Misc] Support attention logits soft-capping with flash-attn (#7022)
|
2024-08-01 13:14:37 -07:00 |
|
Michael Goin
|
fb3db61688
|
[CI/Build] Remove sparseml requirement from testing (#7037)
|
2024-08-01 12:00:51 -07:00 |
|
youkaichao
|
c8a7e93273
|
[core][scheduler] simplify and improve scheduler (#6867)
|
2024-07-31 23:51:09 -07:00 |
|
zifeitong
|
3c10591ef2
|
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954)
|
2024-07-31 21:13:34 -07:00 |
|
Jee Jee Li
|
7ecee34321
|
[Kernel][RFC] Refactor the punica kernel based on Triton (#5036)
|
2024-07-31 17:12:24 -07:00 |
|
Michael Goin
|
460c1884e3
|
[Bugfix] Support cpu offloading with fp8 quantization (#6960)
|
2024-07-31 12:47:46 -07:00 |
|
Cody Yu
|
bd70013407
|
[MISC] Introduce pipeline parallelism partition strategies (#6920)
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-07-31 12:02:17 -07:00 |
|
Cyrus Leung
|
daed30c4a9
|
[Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982)
|
2024-07-31 23:46:17 +08:00 |
|
HandH1998
|
6512937de1
|
Support W4A8 quantization for vllm (#5218)
|
2024-07-31 07:55:21 -06:00 |
|
Cyrus Leung
|
f230cc2ca6
|
[Bugfix] Fix broadcasting logic for multi_modal_kwargs (#6836)
|
2024-07-31 10:38:45 +08:00 |
|
Tyler Michael Smith
|
d7a299edaa
|
[Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842)
|
2024-07-30 16:37:01 -04:00 |
|
Sanger Steel
|
052b6f8ca4
|
[Bugfix] Fix tensorizer memory profiling bug during testing (#6881)
|
2024-07-30 11:48:50 -07:00 |
|
Nick Hill
|
5cf9254a9c
|
[BugFix] Fix use of per-request seed with pipeline parallel (#6698)
|
2024-07-30 10:40:08 -07:00 |
|
Varun Sundar Rabindranath
|
af647fb8b3
|
[Kernel] Tuned int8 kernels for Ada Lovelace (#6848)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-29 20:24:58 -06:00 |
|
Nick Hill
|
9f69d8245a
|
[Frontend] New allowed_token_ids decoding request parameter (#6753)
|
2024-07-29 23:37:27 +00:00 |
|
Thomas Parnell
|
9a7e2d0534
|
[Bugfix] Allow vllm to still work if triton is not installed. (#6786)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-29 14:51:27 -07:00 |
|
Peng Guanwen
|
db9e5708a9
|
[Core] Reduce unnecessary compute when logprobs=None (#6532)
|
2024-07-29 16:47:31 +00:00 |
|
Varun Sundar Rabindranath
|
766435e660
|
[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-29 09:42:35 -06:00 |
|
Isotr0py
|
7cbd9ec7a9
|
[Model] Initialize support for InternVL2 series models (#6514)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-29 10:16:30 +00:00 |
|
Alexander Matveev
|
75acdaa4b6
|
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795)
|
2024-07-27 17:52:33 -04:00 |
|
Cyrus Leung
|
1ad86acf17
|
[Model] Initial support for BLIP-2 (#5920)
Co-authored-by: ywang96 <ywang@roblox.com>
|
2024-07-27 11:53:07 +00:00 |
|
Joe
|
14dbd5a767
|
[Model] H2O Danube3-4b (#6451)
|
2024-07-26 20:47:50 -07:00 |
|
Sanger Steel
|
969d032265
|
[Bugfix]: Fix Tensorizer test failures (#6835)
|
2024-07-26 20:02:25 -07:00 |
|
youkaichao
|
443c7cf4cf
|
[ci][distributed] fix flaky tests (#6806)
|
2024-07-25 17:44:09 -07:00 |
|
Michael Goin
|
65b1f121c8
|
[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints (#6761)
|
2024-07-25 09:46:15 -07:00 |
|
Chang Su
|
316a41ac1d
|
[Bugfix] Fix encoding_format in examples/openai_embedding_client.py (#6755)
|
2024-07-24 22:48:07 -07:00 |
|
Cody Yu
|
309aaef825
|
[Bugfix] Fix decode tokens w. CUDA graph (#6757)
|
2024-07-24 22:33:56 -07:00 |
|
Alphi
|
9e169a4c61
|
[Model] Adding support for MiniCPM-V (#4087)
|
2024-07-24 20:59:30 -07:00 |
|
Evan Z. Liu
|
5689e256ba
|
[Frontend] Represent tokens with identifiable strings (#6626)
|
2024-07-25 09:51:00 +08:00 |
|
Michael Goin
|
421e218b37
|
[Bugfix] Bump transformers to 4.43.2 (#6752)
|
2024-07-24 13:22:16 -07:00 |
|
Antoni Baum
|
0e63494cf3
|
Add fp8 support to reshape_and_cache_flash (#6667)
|
2024-07-24 18:36:52 +00:00 |
|
Nick Hill
|
2cf0df3381
|
[Bugfix] Fix speculative decode seeded test (#6743)
|
2024-07-24 08:58:31 -07:00 |
|
Nick Hill
|
c882a7f5b3
|
[SpecDecoding] Update MLPSpeculator CI tests to use smaller model (#6714)
|
2024-07-24 07:34:22 +00:00 |
|
William Lin
|
5e8ca973eb
|
[Bugfix] fix flashinfer cudagraph capture for PP (#6708)
|
2024-07-24 01:49:44 +00:00 |
|
dongmao zhang
|
87525fab92
|
[bitsandbytes]: support read bnb pre-quantized model (#5753)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-07-23 23:45:09 +00:00 |
|
Thomas Parnell
|
2f808e69ab
|
[Bugfix] StatLoggers: cache spec decode metrics when they get collected. (#6645)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-23 23:05:05 +00:00 |
|
Michael Goin
|
01c16ede6b
|
[CI] Add smoke test for non-uniform AutoFP8 quantization (#6702)
|
2024-07-23 22:45:12 +00:00 |
|
Roger Wang
|
1bedf210e3
|
Bump transformers version for Llama 3.1 hotfix and patch Chameleon (#6690)
|
2024-07-23 13:47:48 -07:00 |
|
Yehoshua Cohen
|
58f53034ad
|
[Frontend] Add Usage data in each chunk for chat_serving. #6540 (#6652)
|
2024-07-23 11:41:55 -07:00 |
|
Roger Wang
|
22fa2e35cb
|
[VLM][Model] Support image input for Chameleon (#6633)
|
2024-07-22 23:50:48 -07:00 |
|
Cyrus Leung
|
97234be0ec
|
[Misc] Manage HTTP connections in one place (#6600)
|
2024-07-22 21:32:02 -07:00 |
|
Michael Goin
|
9e0b558a09
|
[Misc] Support FP8 kv cache scales from compressed-tensors (#6528)
|
2024-07-23 04:11:50 +00:00 |
|
Jiaxin Shan
|
42c7f66a38
|
[Core] Support dynamically loading Lora adapter from HuggingFace (#6234)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2024-07-22 15:42:40 -07:00 |
|
Tyler Michael Smith
|
fea59c7712
|
[Bugfix][Kernel] Use int64_t for indices in fp8 quant kernels (#6649)
|
2024-07-22 14:08:30 -06:00 |
|
Cyrus Leung
|
739b61a348
|
[Frontend] Refactor prompt processing (#4028)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-22 10:13:53 -07:00 |
|
Alexander Matveev
|
396d92d5e0
|
[Kernel][Core] Add AWQ support to the Marlin kernel (#6612)
|
2024-07-21 19:41:42 -04:00 |
|
sroy745
|
14f91fe67c
|
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485)
|
2024-07-20 23:58:58 -07:00 |
|
Cyrus Leung
|
d7f4178dd9
|
[Frontend] Move chat utils (#6602)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-21 08:38:17 +08:00 |
|
Matt Wong
|
06d6c5fe9f
|
[Bugfix][CI/Build][Hardware][AMD] Fix AMD tests, add HF cache, update CK FA, add partially supported model notes (#6543)
|
2024-07-20 09:39:07 -07:00 |
|
Cyrus Leung
|
9042d68362
|
[Misc] Consolidate and optimize logic for building padded tensors (#6541)
|
2024-07-20 04:17:24 +00:00 |
|
Antoni Baum
|
7bd82002ae
|
[Core] Allow specifying custom Executor (#6557)
|
2024-07-20 01:25:06 +00:00 |
|
Varun Sundar Rabindranath
|
2e26564259
|
[ Kernel ] FP8 Dynamic Per Token Quant - Add scale_ub (#6593)
Co-authored-by: Varun Sundar Rabindranth <varun@neuralmagic.com>
|
2024-07-19 18:15:26 -07:00 |
|
Robert Shaw
|
4cc24f01b1
|
[ Kernel ] Enable Dynamic Per Token fp8 (#6547)
|
2024-07-19 23:08:15 +00:00 |
|
Thomas Parnell
|
f0bbfaf917
|
[Bugfix] [SpecDecode] AsyncMetricsCollector: update time since last collection (#6578)
|
2024-07-19 14:01:03 -07:00 |
|
Antoni Baum
|
9ed82e7074
|
[Misc] Small perf improvements (#6520)
|
2024-07-19 12:10:56 -07:00 |
|
Thomas Parnell
|
a5314e8698
|
[Model] RowParallelLinear: pass bias to quant_method.apply (#6327)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-19 07:15:22 -06:00 |
|
Woo-Yeon Lee
|
a921e86392
|
[BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369)
|
2024-07-19 06:01:09 -07:00 |
|
Cyrus Leung
|
6366efc67b
|
[Bugfix][Frontend] Fix missing /metrics endpoint (#6463)
|
2024-07-19 03:55:13 +00:00 |
|
Thomas Parnell
|
d4201e06d5
|
[Bugfix] Make spec. decode respect per-request seed. (#6034)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-07-18 19:22:08 -07:00 |
|
Nick Hill
|
b5672a112c
|
[Core] Multiprocessing Pipeline Parallel support (#6130)
Co-authored-by: Murali Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-07-18 19:15:52 -07:00 |
|
youkaichao
|
f53b8f0d05
|
[ci][test] add correctness test for cpu offloading (#6549)
|
2024-07-18 23:41:06 +00:00 |
|
Nick Hill
|
e2fbaee725
|
[BugFix][Frontend] Use LoRA tokenizer in OpenAI APIs (#6227)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-07-18 15:13:30 +08:00 |
|
Cody Yu
|
b5af8c223c
|
[Model] Pipeline parallel support for Mixtral (#6516)
|
2024-07-17 19:26:04 -07:00 |
|
Varun Sundar Rabindranath
|
b5241e41d9
|
[ Kernel ] FP8 Dynamic-Per-Token Quant Kernel (#6511)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
|
2024-07-18 01:38:35 +00:00 |
|
Alexander Matveev
|
e76466dde2
|
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338)
|
2024-07-17 14:30:28 -07:00 |
|
Antoni Baum
|
5f0b9933e6
|
[Bugfix] Fix Ray Metrics API usage (#6354)
|
2024-07-17 19:40:10 +00:00 |
|
Cody Yu
|
2fa4623d9e
|
[Core] Refactor _prepare_model_input_tensors - take 2 (#6164)
|
2024-07-17 09:37:16 -07:00 |
|
Murali Andoorveedu
|
5fa6e9876e
|
[Bugfix] Fix for multinode crash on 4 PP (#6495)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-07-17 08:25:10 +00:00 |
|
Cyrus Leung
|
5bf35a91e4
|
[Doc][CI/Build] Update docs and tests to use vllm serve (#6431)
|
2024-07-17 07:43:21 +00:00 |
|
youkaichao
|
7f62077af5
|
[misc][distributed] improve tests (#6488)
|
2024-07-16 17:35:52 -07:00 |
|
youkaichao
|
09c2eb85dd
|
[ci][distributed] add pipeline parallel correctness test (#6410)
|
2024-07-16 15:44:22 -07:00 |
|
Michael Goin
|
978aed5300
|
[Kernel][Attention] Separate Attention.kv_scale into k_scale and v_scale (#6081)
|
2024-07-16 15:31:32 -07:00 |
|
Cody Yu
|
160e1d8c99
|
[Misc] Log spec decode metrics (#6454)
|
2024-07-16 20:37:10 +00:00 |
|
Cyrus Leung
|
38ef94888a
|
[CI/Build] Remove "boardwalk" image asset (#6460)
|
2024-07-16 08:59:36 -07:00 |
|
sasha0552
|
7a3d2a5b95
|
[Frontend] Support for chat completions input in the tokenize endpoint (#5923)
|
2024-07-16 20:18:09 +08:00 |
|
Cyrus Leung
|
d97011512e
|
[CI/Build] vLLM cache directory for images (#6444)
|
2024-07-15 23:12:25 -07:00 |
|
Joe
|
d92b3c5cde
|
[Bugfix][CI/Build] Test prompt adapters in openai entrypoint tests (#6419)
|
2024-07-15 18:54:15 -07:00 |
|
Mor Zusman
|
9ad32dacd9
|
[BugFix][Model] Jamba - Handle aborted requests, Add tests and fix cleanup bug (#6425)
Co-authored-by: Mor Zusman <morz@ai21.com>
|
2024-07-16 01:32:55 +00:00 |
|
Thomas Parnell
|
4ef95b0f06
|
[Bugfix] use float32 precision in samplers/test_logprobs.py for comparing with HF (#6409)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-07-15 13:14:49 -04:00 |
|
youkaichao
|
69672f116c
|
[core][distributed] simplify code to support pipeline parallel (#6406)
|
2024-07-14 21:20:51 -07:00 |
|
zifeitong
|
b47008b4d2
|
[BugFix] BatchResponseData body should be optional (#6345)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-07-15 04:06:09 +00:00 |
|
Ethan Xu
|
dbfe254eda
|
[Feature] vLLM CLI (#5090)
Co-authored-by: simon-mo <simon.mo@hey.com>
|
2024-07-14 15:36:43 -07:00 |
|
Isotr0py
|
540c0368b1
|
[Model] Initialize Fuyu-8B support (#3924)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-07-14 05:27:14 +00:00 |
|
youkaichao
|
41708e5034
|
[ci] try to add multi-node tests (#6280)
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
Co-authored-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
|
2024-07-12 21:51:48 -07:00 |
|
Michael Goin
|
111fc6e7ec
|
[Misc] Add generated git commit hash as vllm.__commit__ (#6386)
|
2024-07-12 22:52:15 +00:00 |
|
Yihuan Bu
|
b039cbbce3
|
[Misc] add fixture to guided processor tests (#6341)
|
2024-07-12 09:55:39 -07:00 |
|
Cyrus Leung
|
024ad87cdc
|
[Bugfix] Fix dtype mismatch in PaliGemma (#6367)
|
2024-07-12 08:22:18 -07:00 |
|
Robert Shaw
|
aea19f0989
|
[ Misc ] Support Models With Bias in compressed-tensors integration (#6356)
|
2024-07-12 11:11:29 -04:00 |
|
Hongxia Yang
|
b6c16cf8ff
|
[ROCm][AMD] unify CUDA_VISIBLE_DEVICES usage in cuda/rocm (#6352)
|
2024-07-11 21:30:46 -07:00 |
|
Lily Liu
|
d6ab528997
|
[Misc] Remove flashinfer warning, add flashinfer tests to CI (#6351)
|
2024-07-12 01:32:06 +00:00 |
|
Robert Shaw
|
7ed6a4f0e1
|
[ BugFix ] Prompt Logprobs Detokenization (#6223)
Co-authored-by: Zifei Tong <zifeitong@gmail.com>
|
2024-07-11 22:02:29 +00:00 |
|
xwjiang2010
|
1df43de9bb
|
[bug fix] Fix llava next feature size calculation. (#6339)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
|
2024-07-11 17:21:10 +00:00 |
|
Robert Shaw
|
b675069d74
|
[ Misc ] Refactor Marlin Python Utilities (#6082)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
|
2024-07-11 15:40:11 +00:00 |
|
sroy745
|
ae151d73be
|
[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models (#5765)
|
2024-07-10 16:02:47 -07:00 |
|
youkaichao
|
da78caecfa
|
[core][distributed] zmq fallback for broadcasting large objects (#6183)
[core][distributed] add zmq fallback for broadcasting large objects (#6183)
|
2024-07-09 18:49:11 -07:00 |
|
Abhinav Goyal
|
2416b26e11
|
[Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978)
|
2024-07-09 18:34:02 -07:00 |
|