Kevin H. Luu
|
f1575dc99f
|
[ci] Fix GHA workflow (#8129)
Signed-off-by: kevin <kevin@anyscale.com>
|
2024-09-03 13:25:09 -07:00 |
|
tomeras91
|
c02638efb3
|
[CI/Build] make pip install vllm work in macos (for import only) (#8118)
|
2024-09-03 12:37:08 -07:00 |
|
Antoni Baum
|
652c83b697
|
[Misc] Raise a more informative exception in add/remove_logger (#7750)
|
2024-09-03 12:28:25 -07:00 |
|
Alexander Matveev
|
6d646d08a2
|
[Core] Optimize Async + Multi-step (#8050)
|
2024-09-03 18:50:29 +00:00 |
|
Kevin H. Luu
|
95a178f861
|
[CI] Only PR reviewers/committers can trigger CI on PR (#8124)
Signed-off-by: kevin <kevin@anyscale.com>
|
2024-09-03 11:32:27 -07:00 |
|
Cody Yu
|
bd852f2a8b
|
[Performance] Enable chunked prefill and prefix caching together (#8120)
Co-authored-by: Tao He <sighingnow@gmail.com>
Co-authored-by: Juelianqvq <Juelianqvq@noreply.github.com>
|
2024-09-03 10:49:18 -07:00 |
|
Isotr0py
|
ec266536b7
|
[Bugfix][VLM] Add fallback to SDPA for ViT model running on CPU backend (#8061)
|
2024-09-03 21:37:52 +08:00 |
|
Woosuk Kwon
|
0fbc6696c2
|
[Bugfix] Fix single output condition in output processor (#7881)
|
2024-09-02 20:35:42 -07:00 |
|
wang.yuqi
|
6e36f4fa6c
|
improve chunked prefill performance
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874)
|
2024-09-02 14:20:12 -07:00 |
|
Isotr0py
|
dd2a6a82e3
|
[Bugfix] Fix internlm2 tensor parallel inference (#8055)
|
2024-09-02 23:48:56 +08:00 |
|
Isotr0py
|
4ca65a9763
|
[Core][Bugfix] Accept GGUF model without .gguf extension (#8056)
|
2024-09-02 08:43:26 -04:00 |
|
Woosuk Kwon
|
e2b2aa5a0f
|
[TPU] Align worker index with node boundary (#7932)
|
2024-09-01 23:09:46 -07:00 |
|
Lily Liu
|
e6a26ed037
|
[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244)
|
2024-09-01 21:23:29 -07:00 |
|
Shawn Tan
|
f8d60145b4
|
[Model] Add Granite model (#7436)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-09-01 18:37:18 -07:00 |
|
Roger Wang
|
5b86b19954
|
[Misc] Optional installation of audio related packages (#8063)
|
2024-09-01 14:46:57 -07:00 |
|
Roger Wang
|
5231f0898e
|
[Frontend][VLM] Add support for multiple multi-modal items (#8049)
|
2024-08-31 16:35:53 -07:00 |
|
Robert Shaw
|
8423aef4c8
|
[BugFix][Core] Multistep Fix Crash on Request Cancellation (#8059)
|
2024-08-31 19:44:03 +00:00 |
|
Nicolò Lucchesi
|
4f5d8446ed
|
[Bugfix] Fix ModelScope models in v0.5.5 (#8037)
|
2024-08-31 00:27:58 -07:00 |
|
Cyrus Leung
|
d05f0a9db2
|
[Bugfix] Fix import error in Phi-3.5-MoE (#8052)
|
2024-08-30 22:26:55 -07:00 |
|
Pavani Majety
|
622f8abff8
|
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013)
|
2024-08-30 22:18:50 -07:00 |
|
Wenxiang
|
1248e8506a
|
[Model] Adding support for MSFT Phi-3.5-MoE (#7729)
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Zeqi Lin <zelin@microsoft.com>
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>
|
2024-08-30 13:42:57 -06:00 |
|
Woosuk Kwon
|
2684efc467
|
[TPU][Bugfix] Fix tpu type api (#8035)
|
2024-08-30 09:01:26 -07:00 |
|
Kaunil Dhruv
|
058344f89a
|
[Frontend]-config-cli-args (#7737)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>
|
2024-08-30 08:21:02 -07:00 |
|
Cyrus Leung
|
98cef6a227
|
[Core] Increase default max_num_batched_tokens for multimodal models (#8028)
|
2024-08-30 08:20:34 -07:00 |
|
Jungho Christopher Cho
|
f97be32d1d
|
[VLM][Model] TP support for ViTs (#7186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-08-30 08:19:27 -07:00 |
|
Cyrus Leung
|
afd39a4511
|
[Bugfix] Fix import error in Exaone model (#8034)
|
2024-08-30 08:03:28 -07:00 |
|
Richard Liu
|
2148441fd3
|
[TPU] Support single and multi-host TPUs on GKE (#7613)
|
2024-08-30 00:27:40 -07:00 |
|
Yohan Na
|
dc13e99348
|
[MODEL] add Exaone model support (#7819)
|
2024-08-29 23:34:20 -07:00 |
|
Avshalom Manevich
|
34a0e96d46
|
[Kernel] changing fused moe kernel chunk size default to 32k (#7995)
|
2024-08-30 04:11:39 +00:00 |
|
Woosuk Kwon
|
80c7b089b1
|
[TPU] Async output processing for TPU (#8011)
|
2024-08-29 19:35:29 -07:00 |
|
afeldman-nm
|
428dd1445e
|
[Core] Logprobs support in Multi-step (#7652)
|
2024-08-29 19:19:08 -07:00 |
|
Cyrus Leung
|
4abed65c58
|
[VLM] Disallow overflowing max_model_len for multimodal models (#7998)
|
2024-08-29 17:49:04 -07:00 |
|
Wei-Sheng Chin
|
0c785d344d
|
Add more percentiles and latencies (#7759)
|
2024-08-29 16:48:11 -07:00 |
|
chenqianfzh
|
4664ceaad6
|
support bitsandbytes 8-bit and FP4 quantized models (#7445)
|
2024-08-29 19:09:08 -04:00 |
|
Harsha vardhan manoj Bikki
|
257afc37c5
|
[Neuron] Adding support for context-lenght, token-gen buckets. (#7885)
Co-authored-by: Harsha Bikki <harbikh@amazon.com>
|
2024-08-29 13:58:14 -07:00 |
|
Dipika Sikka
|
86a677de42
|
[misc] update tpu int8 to use new vLLM Parameters (#7973)
|
2024-08-29 16:46:55 -04:00 |
|
Isotr0py
|
d78789ac16
|
[Bugfix] Fix incorrect vocal embedding shards for GGUF model in tensor parallelism (#7954)
|
2024-08-29 15:54:49 -04:00 |
|
kushanam
|
c334b1898b
|
extend cuda graph size for H200 (#7894)
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-08-29 12:15:04 -07:00 |
|
Pavani Majety
|
6b3421567d
|
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-29 14:53:11 -04:00 |
|
Alexander Matveev
|
3f60f2244e
|
[Core] Combine async postprocessor and multi-step (#7921)
|
2024-08-29 11:18:26 -07:00 |
|
Jonas M. Kübler
|
f205c09854
|
[Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899)
|
2024-08-28 22:18:13 -07:00 |
|
youkaichao
|
ef99a78760
|
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982)
|
2024-08-28 21:27:06 -07:00 |
|
Peter Salas
|
74d5543ec5
|
[VLM][Core] Fix exceptions on ragged NestedTensors (#7974)
|
2024-08-29 03:24:31 +00:00 |
|
youkaichao
|
a7f65c2be9
|
[torch.compile] remove reset (#7975)
|
2024-08-28 17:32:26 -07:00 |
|
Nick Hill
|
4289cad37f
|
[Frontend] Minor optimizations to zmq decoupled front-end (#7957)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-08-28 17:22:43 -07:00 |
|
Michael Goin
|
af59df0a10
|
Remove faulty Meta-Llama-3-8B-Instruct-FP8.yaml lm-eval test (#7961)
|
2024-08-28 19:19:17 -04:00 |
|
youkaichao
|
ce6bf3a2cf
|
[torch.compile] avoid Dynamo guard evaluation overhead (#7898)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-08-28 16:10:12 -07:00 |
|
bnellnm
|
3cdfe1f38b
|
[Bugfix] Make torch registration of punica ops optional (#7970)
|
2024-08-28 16:11:49 -06:00 |
|
Mor Zusman
|
fdd9daafa3
|
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651)
|
2024-08-28 15:06:52 -07:00 |
|
Stas Bekman
|
8c56e57def
|
[Doc] fix 404 link (#7966)
|
2024-08-28 13:54:23 -07:00 |
|