Commit Graph

2546 Commits

Author SHA1 Message Date
Cyrus Leung
855c262a6b
[Frontend] Multimodal support in offline chat (#8098) 2024-09-04 05:22:17 +00:00
Peter Salas
2be8ec6e71
[Model] Add Ultravox support for multiple audio chunks (#7963) 2024-09-04 04:38:21 +00:00
Dipika Sikka
e16fa99a6a
[Misc] Update fbgemmfp8 to use vLLMParameters (#7972)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-03 20:12:41 -06:00
Woosuk Kwon
61f4a93d14
[TPU][Bugfix] Use XLA rank for persistent cache path (#8137) 2024-09-03 18:35:33 -07:00
Nick Hill
d4db9f53c8
[Benchmark] Add --async-engine option to benchmark_throughput.py (#7964) 2024-09-03 20:57:41 -04:00
Dipika Sikka
2188a60c7e
[Misc] Update GPTQ to use vLLMParameters (#7976) 2024-09-03 17:21:44 -04:00
Simon Mo
dc0b6066ab
[CI] Change PR remainder to avoid at-mentions (#8134) 2024-09-03 14:11:42 -07:00
Woosuk Kwon
0af3abe3d3
[TPU][Bugfix] Fix next_token_ids shape (#8128) 2024-09-03 13:29:24 -07:00
Kevin H. Luu
f1575dc99f
[ci] Fix GHA workflow (#8129)
Signed-off-by: kevin <kevin@anyscale.com>
2024-09-03 13:25:09 -07:00
tomeras91
c02638efb3
[CI/Build] make pip install vllm work in macos (for import only) (#8118) 2024-09-03 12:37:08 -07:00
Antoni Baum
652c83b697
[Misc] Raise a more informative exception in add/remove_logger (#7750) 2024-09-03 12:28:25 -07:00
Alexander Matveev
6d646d08a2
[Core] Optimize Async + Multi-step (#8050) 2024-09-03 18:50:29 +00:00
Kevin H. Luu
95a178f861
[CI] Only PR reviewers/committers can trigger CI on PR (#8124)
Signed-off-by: kevin <kevin@anyscale.com>
2024-09-03 11:32:27 -07:00
Cody Yu
bd852f2a8b
[Performance] Enable chunked prefill and prefix caching together (#8120)
Co-authored-by: Tao He <sighingnow@gmail.com>
Co-authored-by: Juelianqvq <Juelianqvq@noreply.github.com>
2024-09-03 10:49:18 -07:00
Isotr0py
ec266536b7
[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend (#8061) 2024-09-03 21:37:52 +08:00
Woosuk Kwon
0fbc6696c2
[Bugfix] Fix single output condition in output processor (#7881) 2024-09-02 20:35:42 -07:00
wang.yuqi
6e36f4fa6c
improve chunked prefill performance
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874)
2024-09-02 14:20:12 -07:00
Isotr0py
dd2a6a82e3
[Bugfix] Fix internlm2 tensor parallel inference (#8055) 2024-09-02 23:48:56 +08:00
Isotr0py
4ca65a9763
[Core][Bugfix] Accept GGUF model without .gguf extension (#8056) 2024-09-02 08:43:26 -04:00
Woosuk Kwon
e2b2aa5a0f
[TPU] Align worker index with node boundary (#7932) 2024-09-01 23:09:46 -07:00
Lily Liu
e6a26ed037
[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244) 2024-09-01 21:23:29 -07:00
Shawn Tan
f8d60145b4
[Model] Add Granite model (#7436)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-09-01 18:37:18 -07:00
Roger Wang
5b86b19954
[Misc] Optional installation of audio related packages (#8063) 2024-09-01 14:46:57 -07:00
Roger Wang
5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items (#8049) 2024-08-31 16:35:53 -07:00
Robert Shaw
8423aef4c8
[BugFix][Core] Multistep Fix Crash on Request Cancellation (#8059) 2024-08-31 19:44:03 +00:00
Nicolò Lucchesi
4f5d8446ed
[Bugfix] Fix ModelScope models in v0.5.5 (#8037) 2024-08-31 00:27:58 -07:00
Cyrus Leung
d05f0a9db2
[Bugfix] Fix import error in Phi-3.5-MoE (#8052) 2024-08-30 22:26:55 -07:00
Pavani Majety
622f8abff8
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013) 2024-08-30 22:18:50 -07:00
Wenxiang
1248e8506a
[Model] Adding support for MSFT Phi-3.5-MoE (#7729)
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Zeqi Lin <zelin@microsoft.com>
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>
2024-08-30 13:42:57 -06:00
Woosuk Kwon
2684efc467
[TPU][Bugfix] Fix tpu type api (#8035) 2024-08-30 09:01:26 -07:00
Kaunil Dhruv
058344f89a
[Frontend]-config-cli-args (#7737)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>
2024-08-30 08:21:02 -07:00
Cyrus Leung
98cef6a227
[Core] Increase default max_num_batched_tokens for multimodal models (#8028) 2024-08-30 08:20:34 -07:00
Jungho Christopher Cho
f97be32d1d
[VLM][Model] TP support for ViTs (#7186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-08-30 08:19:27 -07:00
Cyrus Leung
afd39a4511
[Bugfix] Fix import error in Exaone model (#8034) 2024-08-30 08:03:28 -07:00
Richard Liu
2148441fd3
[TPU] Support single and multi-host TPUs on GKE (#7613) 2024-08-30 00:27:40 -07:00
Yohan Na
dc13e99348
[MODEL] add Exaone model support (#7819) 2024-08-29 23:34:20 -07:00
Avshalom Manevich
34a0e96d46
[Kernel] changing fused moe kernel chunk size default to 32k (#7995) 2024-08-30 04:11:39 +00:00
Woosuk Kwon
80c7b089b1
[TPU] Async output processing for TPU (#8011) 2024-08-29 19:35:29 -07:00
afeldman-nm
428dd1445e
[Core] Logprobs support in Multi-step (#7652) 2024-08-29 19:19:08 -07:00
Cyrus Leung
4abed65c58
[VLM] Disallow overflowing max_model_len for multimodal models (#7998) 2024-08-29 17:49:04 -07:00
Wei-Sheng Chin
0c785d344d
Add more percentiles and latencies (#7759) 2024-08-29 16:48:11 -07:00
chenqianfzh
4664ceaad6
support bitsandbytes 8-bit and FP4 quantized models (#7445) 2024-08-29 19:09:08 -04:00
Harsha vardhan manoj Bikki
257afc37c5
[Neuron] Adding support for context-lenght, token-gen buckets. (#7885)
Co-authored-by: Harsha Bikki <harbikh@amazon.com>
2024-08-29 13:58:14 -07:00
Dipika Sikka
86a677de42
[misc] update tpu int8 to use new vLLM Parameters (#7973) 2024-08-29 16:46:55 -04:00
Isotr0py
d78789ac16
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism (#7954) 2024-08-29 15:54:49 -04:00
kushanam
c334b1898b
extend cuda graph size for H200 (#7894)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-29 12:15:04 -07:00
Pavani Majety
6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-29 14:53:11 -04:00
Alexander Matveev
3f60f2244e
[Core] Combine async postprocessor and multi-step (#7921) 2024-08-29 11:18:26 -07:00
Jonas M. Kübler
f205c09854
[Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899) 2024-08-28 22:18:13 -07:00
youkaichao
ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) 2024-08-28 21:27:06 -07:00