Yangshen⚡Deng
6a512a00df
[model] Support for Llava-Next-Video model ( #7559 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-10 22:21:36 -07:00
Pavani Majety
efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. ( #6112 )
2024-09-11 00:38:40 -04:00
Isotr0py
1230263e16
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel ( #8299 )
2024-09-11 10:11:01 +08:00
Cyrus Leung
8c054b7a62
[Frontend] Clean up type annotations for mistral tokenizer ( #8314 )
2024-09-10 16:49:11 +00:00
Dipika Sikka
6cd5e5b07e
[Misc] Fused MoE Marlin support for GPTQ ( #8217 )
2024-09-09 23:02:52 -04:00
Kyle Sayers
c7cb5c3335
[Misc] GPTQ Activation Ordering ( #8135 )
2024-09-09 16:27:26 -04:00
Kyle Mistele
08287ef675
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility ( #8272 )
2024-09-09 10:45:11 -04:00
Joe Runde
cfe712bf1a
[CI/Build] Use python 3.12 in cuda image ( #8133 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-07 13:03:16 -07:00
Isotr0py
e807125936
[Model][VLM] Support multi-images inputs for InternVL2 models ( #8201 )
2024-09-07 16:38:23 +08:00
Cyrus Leung
9f68e00d27
[Bugfix] Fix broken OpenAI tensorizer test ( #8258 )
2024-09-07 08:02:39 +00:00
youkaichao
ce2702a923
[tpu][misc] fix typo ( #8260 )
2024-09-06 22:40:46 -07:00
Cyrus Leung
2f707fcb35
[Model] Multi-input support for LLaVA ( #8238 )
2024-09-07 02:57:24 +00:00
Patrick von Platen
29f49cd6e3
[Model] Allow loading from original Mistral format ( #8168 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-06 17:02:05 -06:00
Alexey Kondratiev(AMD)
1447c97e75
[CI/Build] Increasing timeout for multiproc worker tests ( #8203 )
2024-09-06 11:51:03 -07:00
afeldman-nm
e5cab71531
[Frontend] Add --logprobs argument to benchmark_serving.py ( #8191 )
2024-09-06 09:01:14 -07:00
Jiaxin Shan
db3bf7c991
[Core] Support load and unload LoRA in api server ( #6566 )
...
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-09-05 18:10:33 -07:00
Alex Brooks
9da25a88aa
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) ( #8029 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-05 12:48:10 +00:00
manikandan.tm@zucisystems.com
8685ba1a1e
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) ( #7860 )
2024-09-05 11:33:37 +00:00
Elfie Guo
e39ebf5cf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. ( #8173 )
2024-09-05 05:12:26 +00:00
Kyle Mistele
e02ce498be
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models ( #5649 )
...
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com>
Co-authored-by: Kyle Mistele <kyle@constellate.ai>
2024-09-04 13:18:13 -07:00
Woosuk Kwon
561d6f8077
[CI] Change test input in Gemma LoRA test ( #8163 )
2024-09-04 13:05:50 -07:00
alexeykondrat
d1dec64243
[CI/Build][ROCm] Enabling LoRA tests on ROCm ( #7369 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-04 11:57:54 -07:00
Cody Yu
2ad2e5608e
[MISC] Consolidate FP8 kv-cache tests ( #8131 )
2024-09-04 18:53:25 +00:00
Cyrus Leung
855c262a6b
[Frontend] Multimodal support in offline chat ( #8098 )
2024-09-04 05:22:17 +00:00
Peter Salas
2be8ec6e71
[Model] Add Ultravox support for multiple audio chunks ( #7963 )
2024-09-04 04:38:21 +00:00
Dipika Sikka
2188a60c7e
[Misc] Update GPTQ to use vLLMParameters ( #7976 )
2024-09-03 17:21:44 -04:00
Alexander Matveev
6d646d08a2
[Core] Optimize Async + Multi-step ( #8050 )
2024-09-03 18:50:29 +00:00
wang.yuqi
6e36f4fa6c
improve chunked prefill performance
...
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874 )
2024-09-02 14:20:12 -07:00
Lily Liu
e6a26ed037
[SpecDecode][Kernel] Flashinfer Rejection Sampling ( #7244 )
2024-09-01 21:23:29 -07:00
Shawn Tan
f8d60145b4
[Model] Add Granite model ( #7436 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-09-01 18:37:18 -07:00
Roger Wang
5b86b19954
[Misc] Optional installation of audio related packages ( #8063 )
2024-09-01 14:46:57 -07:00
Roger Wang
5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items ( #8049 )
2024-08-31 16:35:53 -07:00
Pavani Majety
622f8abff8
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. ( #8013 )
2024-08-30 22:18:50 -07:00
Wenxiang
1248e8506a
[Model] Adding support for MSFT Phi-3.5-MoE ( #7729 )
...
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Zeqi Lin <zelin@microsoft.com>
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>
2024-08-30 13:42:57 -06:00
Kaunil Dhruv
058344f89a
[Frontend]-config-cli-args ( #7737 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>
2024-08-30 08:21:02 -07:00
Jungho Christopher Cho
f97be32d1d
[VLM][Model] TP support for ViTs ( #7186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-08-30 08:19:27 -07:00
afeldman-nm
428dd1445e
[Core] Logprobs support in Multi-step ( #7652 )
2024-08-29 19:19:08 -07:00
Cyrus Leung
4abed65c58
[VLM] Disallow overflowing max_model_len for multimodal models ( #7998 )
2024-08-29 17:49:04 -07:00
chenqianfzh
4664ceaad6
support bitsandbytes 8-bit and FP4 quantized models ( #7445 )
2024-08-29 19:09:08 -04:00
Pavani Majety
6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto ( #7985 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-29 14:53:11 -04:00
Alexander Matveev
3f60f2244e
[Core] Combine async postprocessor and multi-step ( #7921 )
2024-08-29 11:18:26 -07:00
Jonas M. Kübler
f205c09854
[Bugfix] Unify rank computation across regular decoding and speculative decoding ( #7899 )
2024-08-28 22:18:13 -07:00
youkaichao
ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." ( #7982 )
2024-08-28 21:27:06 -07:00
Peter Salas
74d5543ec5
[VLM][Core] Fix exceptions on ragged NestedTensors ( #7974 )
2024-08-29 03:24:31 +00:00
youkaichao
a7f65c2be9
[torch.compile] remove reset ( #7975 )
2024-08-28 17:32:26 -07:00
youkaichao
ce6bf3a2cf
[torch.compile] avoid Dynamo guard evaluation overhead ( #7898 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-08-28 16:10:12 -07:00
Mor Zusman
fdd9daafa3
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM ( #7651 )
2024-08-28 15:06:52 -07:00
rasmith
e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ ( #7386 )
2024-08-28 15:37:47 -04:00
Pavani Majety
b98cc28f91
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. ( #7798 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-28 10:01:22 -07:00
Cody Yu
e3580537a4
[Performance] Enable chunked prefill and prefix caching together ( #7753 )
2024-08-28 00:36:31 -07:00
Cyrus Leung
51f86bf487
[mypy][CI/Build] Fix mypy errors ( #7929 )
2024-08-27 23:47:44 -07:00
Peter Salas
fab5f53e2d
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt ( #7902 )
2024-08-28 01:53:56 +00:00
zifeitong
5340a2dccf
[Model] Add multi-image input support for LLaVA-Next offline inference ( #7230 )
2024-08-28 07:09:02 +08:00
Dipika Sikka
fc911880cc
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7766 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-27 15:07:09 -07:00
Isotr0py
9db642138b
[CI/Build][VLM] Cleanup multiple images inputs model test ( #7897 )
2024-08-27 15:28:30 +00:00
Patrick von Platen
6fc4e6e07a
[Model] Add Mistral Tokenization to improve robustness and chat encoding ( #7739 )
2024-08-27 12:40:02 +00:00
youkaichao
64cc644425
[core][torch.compile] discard the compile for profiling ( #7796 )
2024-08-26 21:33:58 -07:00
Nick Hill
39178c7fbc
[Tests] Disable retries and use context manager for openai client ( #7565 )
2024-08-26 21:33:17 -07:00
Megha Agarwal
2eedede875
[Core] Asynchronous Output Processor ( #7049 )
...
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
Dipika Sikka
665304092d
[Misc] Update qqq to use vLLMParameters ( #7805 )
2024-08-26 13:16:15 -06:00
Cody Yu
2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule ( #7822 )
2024-08-26 11:24:53 -07:00
Cyrus Leung
029c71de11
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer ( #7836 )
2024-08-26 05:31:10 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
0b769992ec
[Bugfix]: Use float32 for base64 embedding ( #7855 )
...
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2024-08-26 03:16:38 +00:00
Nick Hill
1856aff4d6
[Spec Decoding] Streamline batch expansion tensor manipulation ( #7851 )
2024-08-25 15:45:14 -07:00
Isotr0py
2059b8d9ca
[Misc] Remove snapshot_download usage in InternVL2 test ( #7835 )
2024-08-25 15:53:09 +00:00
Isotr0py
8aaf3d5347
[Model][VLM] Support multi-images inputs for Phi-3-vision models ( #7783 )
2024-08-25 11:51:20 +00:00
zifeitong
80162c44b1
[Bugfix] Fix Phi-3v crash when input images are of certain sizes ( #7840 )
2024-08-24 18:16:24 -07:00
youkaichao
aab0fcdb63
[ci][test] fix RemoteOpenAIServer ( #7838 )
2024-08-24 17:31:28 +00:00
youkaichao
ea9fa160e3
[ci][test] exclude model download time in server start time ( #7834 )
2024-08-24 01:03:27 -07:00
youkaichao
7d9ffa2ae1
[misc][core] lazy import outlines ( #7831 )
2024-08-24 00:51:38 -07:00
Tyler Rockwood
d81abefd2e
[Frontend] add json_schema support from OpenAI protocol ( #7654 )
2024-08-23 23:07:24 -07:00
Pooya Davoodi
8da48e4d95
[Frontend] Publish Prometheus metrics in run_batch API ( #7641 )
2024-08-23 23:04:22 -07:00
Alexander Matveev
9db93de20c
[Core] Add multi-step support to LLMEngine ( #7789 )
2024-08-23 12:45:53 -07:00
Dipika Sikka
f1df5dbfd6
[Misc] Update marlin to use vLLMParameters ( #7803 )
2024-08-23 14:30:52 -04:00
Maximilien de Bayser
e25fee57c2
[BugFix] Fix server crash on empty prompt ( #7746 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-08-23 13:12:44 +00:00
SangBin Cho
c01a6cb231
[Ray backend] Better error when pg topology is bad. ( #7584 )
...
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-22 17:44:25 -07:00
Joe Runde
b903e1ba7f
[Frontend] error suppression cleanup ( #7786 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 21:50:21 +00:00
Travis Johnson
cc0eaf12b1
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output ( #7232 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-22 09:33:48 -04:00
Dipika Sikka
955b5191c9
[Misc] update fp8 to use vLLMParameter ( #7437 )
2024-08-22 08:36:18 -04:00
Abhinav Goyal
a3fce56b88
[Speculative Decoding] EAGLE Implementation with Top-1 proposer ( #6830 )
2024-08-22 02:42:24 -07:00
Michael Goin
aae74ef95c
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )" ( #7764 )
2024-08-22 03:42:14 +00:00
Joe Runde
cde9183b40
[Bug][Frontend] Improve ZMQ client robustness ( #7443 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 02:18:11 +00:00
zifeitong
df1a21131d
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue ( #7710 )
2024-08-22 09:36:24 +08:00
Luka Govedič
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce ( #7233 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00
Dipika Sikka
8678a69ab5
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-21 16:17:10 -07:00
Peter Salas
1ca0d4f86b
[Model] Add UltravoxModel and UltravoxConfig ( #7615 )
2024-08-21 22:49:39 +00:00
Robert Shaw
970dfdc01d
[Frontend] Improve Startup Failure UX ( #7716 )
2024-08-21 19:53:01 +00:00
Robert Shaw
f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend ( #7394 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-21 13:34:14 -04:00
LI MOU
53328d7536
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] ( #7509 )
2024-08-21 08:54:31 -07:00
Nick Hill
c75363fbc0
[BugFix] Avoid premature async generator exit and raise all exception variations ( #7698 )
2024-08-21 11:45:55 -04:00
Cyrus Leung
baaedfdb2d
[mypy] Enable following imports for entrypoints ( #7248 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Fei <dfdfcai4@gmail.com>
2024-08-20 23:28:21 -07:00
Isotr0py
12e1c65bc9
[Model] Add AWQ quantization support for InternVL2 model ( #7187 )
2024-08-20 23:18:57 -07:00
youkaichao
9e51b6a626
[ci][test] adjust max wait time for cpu offloading test ( #7709 )
2024-08-20 17:12:44 -07:00
Antoni Baum
3b682179dd
[Core] Add AttentionState abstraction ( #7663 )
2024-08-20 18:50:45 +00:00
Isotr0py
aae6927be0
[VLM][Model] Add test for InternViT vision encoder ( #7409 )
2024-08-20 23:10:20 +08:00
Lucas Wilkinson
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel ( #7174 )
2024-08-20 07:09:33 -06:00
Abhinav Goyal
312f761232
[Speculative Decoding] Fixing hidden states handling in batch expansion ( #7508 )
2024-08-19 17:58:14 -07:00
Isotr0py
7601cb044d
[Core] Support tensor parallelism for GGUF quantization ( #7520 )
2024-08-19 17:30:14 -04:00
William Lin
47b65a5508
[core] Multi Step Scheduling ( #7000 )
...
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
2024-08-19 13:52:13 -07:00
Cody Yu
3ac50b47d0
[MISC] Add prefix cache hit rate to metrics ( #7606 )
2024-08-19 11:52:07 -07:00