Kyle Mistele
|
08287ef675
|
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (#8272)
|
2024-09-09 10:45:11 -04:00 |
|
Joe Runde
|
cfe712bf1a
|
[CI/Build] Use python 3.12 in cuda image (#8133)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-09-07 13:03:16 -07:00 |
|
Isotr0py
|
e807125936
|
[Model][VLM] Support multi-images inputs for InternVL2 models (#8201)
|
2024-09-07 16:38:23 +08:00 |
|
Cyrus Leung
|
9f68e00d27
|
[Bugfix] Fix broken OpenAI tensorizer test (#8258)
|
2024-09-07 08:02:39 +00:00 |
|
youkaichao
|
ce2702a923
|
[tpu][misc] fix typo (#8260)
|
2024-09-06 22:40:46 -07:00 |
|
Cyrus Leung
|
2f707fcb35
|
[Model] Multi-input support for LLaVA (#8238)
|
2024-09-07 02:57:24 +00:00 |
|
Patrick von Platen
|
29f49cd6e3
|
[Model] Allow loading from original Mistral format (#8168)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-09-06 17:02:05 -06:00 |
|
Alexey Kondratiev(AMD)
|
1447c97e75
|
[CI/Build] Increasing timeout for multiproc worker tests (#8203)
|
2024-09-06 11:51:03 -07:00 |
|
afeldman-nm
|
e5cab71531
|
[Frontend] Add --logprobs argument to benchmark_serving.py (#8191)
|
2024-09-06 09:01:14 -07:00 |
|
Jiaxin Shan
|
db3bf7c991
|
[Core] Support load and unload LoRA in api server (#6566)
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
|
2024-09-05 18:10:33 -07:00 |
|
Alex Brooks
|
9da25a88aa
|
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
2024-09-05 12:48:10 +00:00 |
|
manikandan.tm@zucisystems.com
|
8685ba1a1e
|
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) (#7860)
|
2024-09-05 11:33:37 +00:00 |
|
Elfie Guo
|
e39ebf5cf5
|
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173)
|
2024-09-05 05:12:26 +00:00 |
|
Kyle Mistele
|
e02ce498be
|
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (#5649)
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com>
Co-authored-by: Kyle Mistele <kyle@constellate.ai>
|
2024-09-04 13:18:13 -07:00 |
|
Woosuk Kwon
|
561d6f8077
|
[CI] Change test input in Gemma LoRA test (#8163)
|
2024-09-04 13:05:50 -07:00 |
|
alexeykondrat
|
d1dec64243
|
[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-09-04 11:57:54 -07:00 |
|
Cody Yu
|
2ad2e5608e
|
[MISC] Consolidate FP8 kv-cache tests (#8131)
|
2024-09-04 18:53:25 +00:00 |
|
Cyrus Leung
|
855c262a6b
|
[Frontend] Multimodal support in offline chat (#8098)
|
2024-09-04 05:22:17 +00:00 |
|
Peter Salas
|
2be8ec6e71
|
[Model] Add Ultravox support for multiple audio chunks (#7963)
|
2024-09-04 04:38:21 +00:00 |
|
Dipika Sikka
|
2188a60c7e
|
[Misc] Update GPTQ to use vLLMParameters (#7976)
|
2024-09-03 17:21:44 -04:00 |
|
Alexander Matveev
|
6d646d08a2
|
[Core] Optimize Async + Multi-step (#8050)
|
2024-09-03 18:50:29 +00:00 |
|
wang.yuqi
|
6e36f4fa6c
|
improve chunked prefill performance
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874)
|
2024-09-02 14:20:12 -07:00 |
|
Lily Liu
|
e6a26ed037
|
[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244)
|
2024-09-01 21:23:29 -07:00 |
|
Shawn Tan
|
f8d60145b4
|
[Model] Add Granite model (#7436)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-09-01 18:37:18 -07:00 |
|
Roger Wang
|
5b86b19954
|
[Misc] Optional installation of audio related packages (#8063)
|
2024-09-01 14:46:57 -07:00 |
|
Roger Wang
|
5231f0898e
|
[Frontend][VLM] Add support for multiple multi-modal items (#8049)
|
2024-08-31 16:35:53 -07:00 |
|
Pavani Majety
|
622f8abff8
|
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013)
|
2024-08-30 22:18:50 -07:00 |
|
Wenxiang
|
1248e8506a
|
[Model] Adding support for MSFT Phi-3.5-MoE (#7729)
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Zeqi Lin <zelin@microsoft.com>
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>
|
2024-08-30 13:42:57 -06:00 |
|
Kaunil Dhruv
|
058344f89a
|
[Frontend]-config-cli-args (#7737)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>
|
2024-08-30 08:21:02 -07:00 |
|
Jungho Christopher Cho
|
f97be32d1d
|
[VLM][Model] TP support for ViTs (#7186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-08-30 08:19:27 -07:00 |
|
afeldman-nm
|
428dd1445e
|
[Core] Logprobs support in Multi-step (#7652)
|
2024-08-29 19:19:08 -07:00 |
|
Cyrus Leung
|
4abed65c58
|
[VLM] Disallow overflowing max_model_len for multimodal models (#7998)
|
2024-08-29 17:49:04 -07:00 |
|
chenqianfzh
|
4664ceaad6
|
support bitsandbytes 8-bit and FP4 quantized models (#7445)
|
2024-08-29 19:09:08 -04:00 |
|
Pavani Majety
|
6b3421567d
|
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
|
2024-08-29 14:53:11 -04:00 |
|
Alexander Matveev
|
3f60f2244e
|
[Core] Combine async postprocessor and multi-step (#7921)
|
2024-08-29 11:18:26 -07:00 |
|
Jonas M. Kübler
|
f205c09854
|
[Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899)
|
2024-08-28 22:18:13 -07:00 |
|
youkaichao
|
ef99a78760
|
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982)
|
2024-08-28 21:27:06 -07:00 |
|
Peter Salas
|
74d5543ec5
|
[VLM][Core] Fix exceptions on ragged NestedTensors (#7974)
|
2024-08-29 03:24:31 +00:00 |
|
youkaichao
|
a7f65c2be9
|
[torch.compile] remove reset (#7975)
|
2024-08-28 17:32:26 -07:00 |
|
youkaichao
|
ce6bf3a2cf
|
[torch.compile] avoid Dynamo guard evaluation overhead (#7898)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-08-28 16:10:12 -07:00 |
|
Mor Zusman
|
fdd9daafa3
|
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651)
|
2024-08-28 15:06:52 -07:00 |
|
rasmith
|
e5697d161c
|
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386)
|
2024-08-28 15:37:47 -04:00 |
|
Pavani Majety
|
b98cc28f91
|
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-08-28 10:01:22 -07:00 |
|
Cody Yu
|
e3580537a4
|
[Performance] Enable chunked prefill and prefix caching together (#7753)
|
2024-08-28 00:36:31 -07:00 |
|
Cyrus Leung
|
51f86bf487
|
[mypy][CI/Build] Fix mypy errors (#7929)
|
2024-08-27 23:47:44 -07:00 |
|
Peter Salas
|
fab5f53e2d
|
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902)
|
2024-08-28 01:53:56 +00:00 |
|
zifeitong
|
5340a2dccf
|
[Model] Add multi-image input support for LLaVA-Next offline inference (#7230)
|
2024-08-28 07:09:02 +08:00 |
|
Dipika Sikka
|
fc911880cc
|
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
|
2024-08-27 15:07:09 -07:00 |
|
Isotr0py
|
9db642138b
|
[CI/Build][VLM] Cleanup multiple images inputs model test (#7897)
|
2024-08-27 15:28:30 +00:00 |
|
Patrick von Platen
|
6fc4e6e07a
|
[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739)
|
2024-08-27 12:40:02 +00:00 |
|
youkaichao
|
64cc644425
|
[core][torch.compile] discard the compile for profiling (#7796)
|
2024-08-26 21:33:58 -07:00 |
|
Nick Hill
|
39178c7fbc
|
[Tests] Disable retries and use context manager for openai client (#7565)
|
2024-08-26 21:33:17 -07:00 |
|
Megha Agarwal
|
2eedede875
|
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
|
2024-08-26 20:53:20 -07:00 |
|
Dipika Sikka
|
665304092d
|
[Misc] Update qqq to use vLLMParameters (#7805)
|
2024-08-26 13:16:15 -06:00 |
|
Cody Yu
|
2deb029d11
|
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822)
|
2024-08-26 11:24:53 -07:00 |
|
Cyrus Leung
|
029c71de11
|
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer (#7836)
|
2024-08-26 05:31:10 +00:00 |
|
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
|
0b769992ec
|
[Bugfix]: Use float32 for base64 embedding (#7855)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
|
2024-08-26 03:16:38 +00:00 |
|
Nick Hill
|
1856aff4d6
|
[Spec Decoding] Streamline batch expansion tensor manipulation (#7851)
|
2024-08-25 15:45:14 -07:00 |
|
Isotr0py
|
2059b8d9ca
|
[Misc] Remove snapshot_download usage in InternVL2 test (#7835)
|
2024-08-25 15:53:09 +00:00 |
|
Isotr0py
|
8aaf3d5347
|
[Model][VLM] Support multi-images inputs for Phi-3-vision models (#7783)
|
2024-08-25 11:51:20 +00:00 |
|
zifeitong
|
80162c44b1
|
[Bugfix] Fix Phi-3v crash when input images are of certain sizes (#7840)
|
2024-08-24 18:16:24 -07:00 |
|
youkaichao
|
aab0fcdb63
|
[ci][test] fix RemoteOpenAIServer (#7838)
|
2024-08-24 17:31:28 +00:00 |
|
youkaichao
|
ea9fa160e3
|
[ci][test] exclude model download time in server start time (#7834)
|
2024-08-24 01:03:27 -07:00 |
|
youkaichao
|
7d9ffa2ae1
|
[misc][core] lazy import outlines (#7831)
|
2024-08-24 00:51:38 -07:00 |
|
Tyler Rockwood
|
d81abefd2e
|
[Frontend] add json_schema support from OpenAI protocol (#7654)
|
2024-08-23 23:07:24 -07:00 |
|
Pooya Davoodi
|
8da48e4d95
|
[Frontend] Publish Prometheus metrics in run_batch API (#7641)
|
2024-08-23 23:04:22 -07:00 |
|
Alexander Matveev
|
9db93de20c
|
[Core] Add multi-step support to LLMEngine (#7789)
|
2024-08-23 12:45:53 -07:00 |
|
Dipika Sikka
|
f1df5dbfd6
|
[Misc] Update marlin to use vLLMParameters (#7803)
|
2024-08-23 14:30:52 -04:00 |
|
Maximilien de Bayser
|
e25fee57c2
|
[BugFix] Fix server crash on empty prompt (#7746)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
|
2024-08-23 13:12:44 +00:00 |
|
SangBin Cho
|
c01a6cb231
|
[Ray backend] Better error when pg topology is bad. (#7584)
Co-authored-by: youkaichao <youkaichao@126.com>
|
2024-08-22 17:44:25 -07:00 |
|
Joe Runde
|
b903e1ba7f
|
[Frontend] error suppression cleanup (#7786)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-08-22 21:50:21 +00:00 |
|
Travis Johnson
|
cc0eaf12b1
|
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-08-22 09:33:48 -04:00 |
|
Dipika Sikka
|
955b5191c9
|
[Misc] update fp8 to use vLLMParameter (#7437)
|
2024-08-22 08:36:18 -04:00 |
|
Abhinav Goyal
|
a3fce56b88
|
[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830)
|
2024-08-22 02:42:24 -07:00 |
|
Michael Goin
|
aae74ef95c
|
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764)
|
2024-08-22 03:42:14 +00:00 |
|
Joe Runde
|
cde9183b40
|
[Bug][Frontend] Improve ZMQ client robustness (#7443)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-08-22 02:18:11 +00:00 |
|
zifeitong
|
df1a21131d
|
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710)
|
2024-08-22 09:36:24 +08:00 |
|
Luka Govedič
|
7937009a7e
|
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce (#7233)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-08-21 20:18:00 -04:00 |
|
Dipika Sikka
|
8678a69ab5
|
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
|
2024-08-21 16:17:10 -07:00 |
|
Peter Salas
|
1ca0d4f86b
|
[Model] Add UltravoxModel and UltravoxConfig (#7615)
|
2024-08-21 22:49:39 +00:00 |
|
Robert Shaw
|
970dfdc01d
|
[Frontend] Improve Startup Failure UX (#7716)
|
2024-08-21 19:53:01 +00:00 |
|
Robert Shaw
|
f7e3b0c5aa
|
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend (#7394)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-08-21 13:34:14 -04:00 |
|
LI MOU
|
53328d7536
|
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509)
|
2024-08-21 08:54:31 -07:00 |
|
Nick Hill
|
c75363fbc0
|
[BugFix] Avoid premature async generator exit and raise all exception variations (#7698)
|
2024-08-21 11:45:55 -04:00 |
|
Cyrus Leung
|
baaedfdb2d
|
[mypy] Enable following imports for entrypoints (#7248)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Fei <dfdfcai4@gmail.com>
|
2024-08-20 23:28:21 -07:00 |
|
Isotr0py
|
12e1c65bc9
|
[Model] Add AWQ quantization support for InternVL2 model (#7187)
|
2024-08-20 23:18:57 -07:00 |
|
youkaichao
|
9e51b6a626
|
[ci][test] adjust max wait time for cpu offloading test (#7709)
|
2024-08-20 17:12:44 -07:00 |
|
Antoni Baum
|
3b682179dd
|
[Core] Add AttentionState abstraction (#7663)
|
2024-08-20 18:50:45 +00:00 |
|
Isotr0py
|
aae6927be0
|
[VLM][Model] Add test for InternViT vision encoder (#7409)
|
2024-08-20 23:10:20 +08:00 |
|
Lucas Wilkinson
|
5288c06aa0
|
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
|
2024-08-20 07:09:33 -06:00 |
|
Abhinav Goyal
|
312f761232
|
[Speculative Decoding] Fixing hidden states handling in batch expansion (#7508)
|
2024-08-19 17:58:14 -07:00 |
|
Isotr0py
|
7601cb044d
|
[Core] Support tensor parallelism for GGUF quantization (#7520)
|
2024-08-19 17:30:14 -04:00 |
|
William Lin
|
47b65a5508
|
[core] Multi Step Scheduling (#7000)
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
|
2024-08-19 13:52:13 -07:00 |
|
Cody Yu
|
3ac50b47d0
|
[MISC] Add prefix cache hit rate to metrics (#7606)
|
2024-08-19 11:52:07 -07:00 |
|
Peng Guanwen
|
f710fb5265
|
[Core] Use flashinfer sampling kernel when available (#7137)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-08-19 03:24:03 +00:00 |
|
SangBin Cho
|
ff7ec82c4d
|
[Core] Optimize SPMD architecture with delta + serialization optimization (#7109)
|
2024-08-18 17:57:20 -07:00 |
|
Alex Brooks
|
40e1360bb6
|
[CI/Build] Add text-only test for Qwen models (#7475)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
|
2024-08-19 07:43:46 +08:00 |
|
Robert Shaw
|
e3b318216d
|
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend (#7279)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-08-18 20:19:48 +00:00 |
|
Roger Wang
|
bbf55c4805
|
[VLM] Refactor MultiModalConfig initialization and profiling (#7530)
|
2024-08-17 13:30:55 -07:00 |
|
youkaichao
|
832163b875
|
[ci][test] allow longer wait time for api server (#7629)
|
2024-08-17 11:26:38 -07:00 |
|