Joe Runde
|
cfe712bf1a
|
[CI/Build] Use python 3.12 in cuda image (#8133)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-09-07 13:03:16 -07:00 |
|
Isotr0py
|
e807125936
|
[Model][VLM] Support multi-images inputs for InternVL2 models (#8201)
|
2024-09-07 16:38:23 +08:00 |
|
Cyrus Leung
|
9f68e00d27
|
[Bugfix] Fix broken OpenAI tensorizer test (#8258)
|
2024-09-07 08:02:39 +00:00 |
|
youkaichao
|
ce2702a923
|
[tpu][misc] fix typo (#8260)
|
2024-09-06 22:40:46 -07:00 |
|
Cyrus Leung
|
2f707fcb35
|
[Model] Multi-input support for LLaVA (#8238)
|
2024-09-07 02:57:24 +00:00 |
|
Patrick von Platen
|
29f49cd6e3
|
[Model] Allow loading from original Mistral format (#8168)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-09-06 17:02:05 -06:00 |
|
Alexey Kondratiev(AMD)
|
1447c97e75
|
[CI/Build] Increasing timeout for multiproc worker tests (#8203)
|
2024-09-06 11:51:03 -07:00 |
|
afeldman-nm
|
e5cab71531
|
[Frontend] Add --logprobs argument to benchmark_serving.py (#8191)
|
2024-09-06 09:01:14 -07:00 |
|
Jiaxin Shan
|
db3bf7c991
|
[Core] Support load and unload LoRA in api server (#6566)
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2024-09-05 18:10:33 -07:00 |
|
Alex Brooks
|
9da25a88aa
|
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-09-05 12:48:10 +00:00 |
|
manikandan.tm@zucisystems.com
|
8685ba1a1e
|
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) (#7860)
|
2024-09-05 11:33:37 +00:00 |
|
Elfie Guo
|
e39ebf5cf5
|
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173)
|
2024-09-05 05:12:26 +00:00 |
|
Kyle Mistele
|
e02ce498be
|
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (#5649)
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com>
Co-authored-by: Kyle Mistele <kyle@constellate.ai>
|
2024-09-04 13:18:13 -07:00 |
|
Woosuk Kwon
|
561d6f8077
|
[CI] Change test input in Gemma LoRA test (#8163)
|
2024-09-04 13:05:50 -07:00 |
|
alexeykondrat
|
d1dec64243
|
[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-09-04 11:57:54 -07:00 |
|
Cody Yu
|
2ad2e5608e
|
[MISC] Consolidate FP8 kv-cache tests (#8131)
|
2024-09-04 18:53:25 +00:00 |
|
Cyrus Leung
|
855c262a6b
|
[Frontend] Multimodal support in offline chat (#8098)
|
2024-09-04 05:22:17 +00:00 |
|
Peter Salas
|
2be8ec6e71
|
[Model] Add Ultravox support for multiple audio chunks (#7963)
|
2024-09-04 04:38:21 +00:00 |
|
Dipika Sikka
|
2188a60c7e
|
[Misc] Update GPTQ to use vLLMParameters (#7976)
|
2024-09-03 17:21:44 -04:00 |
|
Alexander Matveev
|
6d646d08a2
|
[Core] Optimize Async + Multi-step (#8050)
|
2024-09-03 18:50:29 +00:00 |
|
wang.yuqi
|
6e36f4fa6c
|
improve chunked prefill performance
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874)
|
2024-09-02 14:20:12 -07:00 |
|
Lily Liu
|
e6a26ed037
|
[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244)
|
2024-09-01 21:23:29 -07:00 |
|
Shawn Tan
|
f8d60145b4
|
[Model] Add Granite model (#7436)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-09-01 18:37:18 -07:00 |
|
Roger Wang
|
5b86b19954
|
[Misc] Optional installation of audio related packages (#8063)
|
2024-09-01 14:46:57 -07:00 |
|
Roger Wang
|
5231f0898e
|
[Frontend][VLM] Add support for multiple multi-modal items (#8049)
|
2024-08-31 16:35:53 -07:00 |
|
Pavani Majety
|
622f8abff8
|
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013)
|
2024-08-30 22:18:50 -07:00 |
|
Wenxiang
|
1248e8506a
|
[Model] Adding support for MSFT Phi-3.5-MoE (#7729)
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Zeqi Lin <zelin@microsoft.com>
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>
|
2024-08-30 13:42:57 -06:00 |
|
Kaunil Dhruv
|
058344f89a
|
[Frontend]-config-cli-args (#7737)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>
|
2024-08-30 08:21:02 -07:00 |
|
Jungho Christopher Cho
|
f97be32d1d
|
[VLM][Model] TP support for ViTs (#7186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-08-30 08:19:27 -07:00 |
|
afeldman-nm
|
428dd1445e
|
[Core] Logprobs support in Multi-step (#7652)
|
2024-08-29 19:19:08 -07:00 |
|
Cyrus Leung
|
4abed65c58
|
[VLM] Disallow overflowing max_model_len for multimodal models (#7998)
|
2024-08-29 17:49:04 -07:00 |
|
chenqianfzh
|
4664ceaad6
|
support bitsandbytes 8-bit and FP4 quantized models (#7445)
|
2024-08-29 19:09:08 -04:00 |
|
Pavani Majety
|
6b3421567d
|
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-29 14:53:11 -04:00 |
|
Alexander Matveev
|
3f60f2244e
|
[Core] Combine async postprocessor and multi-step (#7921)
|
2024-08-29 11:18:26 -07:00 |
|
Jonas M. Kübler
|
f205c09854
|
[Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899)
|
2024-08-28 22:18:13 -07:00 |
|
youkaichao
|
ef99a78760
|
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982)
|
2024-08-28 21:27:06 -07:00 |
|
Peter Salas
|
74d5543ec5
|
[VLM][Core] Fix exceptions on ragged NestedTensors (#7974)
|
2024-08-29 03:24:31 +00:00 |
|
youkaichao
|
a7f65c2be9
|
[torch.compile] remove reset (#7975)
|
2024-08-28 17:32:26 -07:00 |
|
youkaichao
|
ce6bf3a2cf
|
[torch.compile] avoid Dynamo guard evaluation overhead (#7898)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-08-28 16:10:12 -07:00 |
|
Mor Zusman
|
fdd9daafa3
|
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651)
|
2024-08-28 15:06:52 -07:00 |
|
rasmith
|
e5697d161c
|
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
|
2024-08-28 15:37:47 -04:00 |
|
Pavani Majety
|
b98cc28f91
|
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-08-28 10:01:22 -07:00 |
|
Cody Yu
|
e3580537a4
|
[Performance] Enable chunked prefill and prefix caching together (#7753)
|
2024-08-28 00:36:31 -07:00 |
|
Cyrus Leung
|
51f86bf487
|
[mypy][CI/Build] Fix mypy errors (#7929)
|
2024-08-27 23:47:44 -07:00 |
|
Peter Salas
|
fab5f53e2d
|
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902)
|
2024-08-28 01:53:56 +00:00 |
|
zifeitong
|
5340a2dccf
|
[Model] Add multi-image input support for LLaVA-Next offline inference (#7230)
|
2024-08-28 07:09:02 +08:00 |
|
Dipika Sikka
|
fc911880cc
|
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
|
2024-08-27 15:07:09 -07:00 |
|
Isotr0py
|
9db642138b
|
[CI/Build][VLM] Cleanup multiple images inputs model test (#7897)
|
2024-08-27 15:28:30 +00:00 |
|
Patrick von Platen
|
6fc4e6e07a
|
[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739)
|
2024-08-27 12:40:02 +00:00 |
|
youkaichao
|
64cc644425
|
[core][torch.compile] discard the compile for profiling (#7796)
|
2024-08-26 21:33:58 -07:00 |
|