Commit Graph

988 Commits

Author SHA1 Message Date
Patrick von Platen
b4e4eda92e
[Bugfix][Core] Fix tekken edge case for mistral tokenizer (#8640) 2024-09-20 14:33:03 -07:00
Jiaxin Shan
260d40b5ea
[Core] Support Lora lineage and base model metadata management (#6315) 2024-09-20 06:20:56 +00:00
Charlie Fu
9cc373f390
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577) 2024-09-19 17:37:57 +00:00
sroy745
3118f63385
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (#8545) 2024-09-19 02:24:15 +00:00
Tyler Michael Smith
db9120cded
[Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039) 2024-09-18 20:05:06 +00:00
afeldman-nm
a8c1d161a7
[Core] *Prompt* logprobs support in Multi-step (#8199) 2024-09-18 08:38:43 -07:00
Alexander Matveev
7c7714d856
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH (#8157)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-18 13:56:58 +00:00
Aaron Pham
9d104b5beb
[CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-18 11:00:56 +00:00
Cyrus Leung
6ffa3f314c
[CI/Build] Avoid CUDA initialization (#8534) 2024-09-18 10:38:11 +00:00
Tyler Michael Smith
8110e44529
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012) 2024-09-17 23:44:27 +00:00
youkaichao
fa0c114fad
[doc] improve installation doc (#8550)
Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com>
2024-09-17 16:24:06 -07:00
Patrick von Platen
a54ed80249
[Model] Add mistral function calling format to all models loaded with "mistral" format (#8515)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-17 17:50:37 +00:00
chenqianfzh
9855b99502
[Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434) 2024-09-17 08:09:12 -07:00
sroy745
1009e93c5d
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631) 2024-09-17 07:35:01 -07:00
youkaichao
99aa4eddaf
[torch.compile] register allreduce operations as custom ops (#8526) 2024-09-16 22:57:57 -07:00
Alex Brooks
1c1bb388e0
[Frontend] Improve Nullable kv Arg Parsing (#8525)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-09-17 04:17:32 +00:00
Simon Mo
546034b466
[refactor] remove triton based sampler (#8524) 2024-09-16 20:04:48 -07:00
Luka Govedič
5d73ae49d6
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270) 2024-09-16 11:52:40 -07:00
Nick Hill
acd5511b6d
[BugFix] Fix clean shutdown issues (#8492) 2024-09-16 09:33:46 -07:00
ElizaWszola
a091e2da3e
[Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
2024-09-16 09:47:19 -06:00
Isotr0py
fc990f9795
[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel (#8357) 2024-09-15 16:51:44 -06:00
youkaichao
47790f3e32
[torch.compile] add a flag to disable custom op (#8488) 2024-09-14 13:07:16 -07:00
youkaichao
a36e070dad
[torch.compile] fix functionalization (#8480) 2024-09-14 09:46:04 -07:00
ywfang
8a0cf1ddc3
[Model] support minicpm3 (#8297)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-14 14:50:26 +00:00
Charlie Fu
1ef0d2efd0
[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (#8310) 2024-09-13 17:01:11 -07:00
Nick Hill
18e9e1f7b3
[HotFix] Fix final output truncation with stop string + streaming (#8468) 2024-09-13 11:31:12 -07:00
Cyrus Leung
a84e598e21
[CI/Build] Reorganize models tests (#7820) 2024-09-13 10:20:06 -07:00
youkaichao
a2469127db
[misc][ci] fix quant test (#8449) 2024-09-13 17:20:14 +08:00
Isotr0py
9b4a3b235e
[CI/Build] Enable InternVL2 PP test only on single node (#8437) 2024-09-13 06:35:20 +00:00
Alexander Matveev
6821020109
[Bugfix] Fix async log stats (#8417) 2024-09-12 20:48:59 -07:00
Cyrus Leung
8427550488
[CI/Build] Update pixtral tests to use JSON (#8436) 2024-09-13 03:47:52 +00:00
shangmingc
40c396533d
[Bugfix] Mapping physical device indices for e2e test utils (#8290) 2024-09-13 11:06:28 +08:00
Cyrus Leung
5ec9c0fb3c
[Core] Factor out input preprocessing to a separate class (#7329) 2024-09-13 02:56:13 +00:00
Patrick von Platen
d31174a4e1
[Hotfix][Pixtral] Fix multiple images bugs (#8415) 2024-09-12 15:21:51 -07:00
Roger Wang
b61bd98f90
[CI/Build] Disable multi-node test for InternVL2 (#8428) 2024-09-12 15:05:35 -07:00
Nick Hill
551ce01078
[Core] Add engine option to return only deltas or final output (#7381) 2024-09-12 12:02:00 -07:00
William Lin
a6c0f3658d
[multi-step] add flashinfer backend (#7928) 2024-09-12 11:16:22 -07:00
Joe Runde
f2e263b801
[Bugfix] Offline mode fix (#8376)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-12 11:11:57 -07:00
Alex Brooks
c6202daeed
[Model] Support multiple images for qwen-vl (#8247)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-12 10:10:54 -07:00
Isotr0py
e56bf27741
[Bugfix] Fix InternVL2 inference with various num_patches (#8375)
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-12 10:10:35 -07:00
youkaichao
7de49aa86c
[torch.compile] hide slicing under custom op for inductor (#8384) 2024-09-12 00:11:55 -07:00
youkaichao
f842a7aff1
[misc] remove engine_use_ray (#8126) 2024-09-11 18:23:36 -07:00
Cody Yu
a65cb16067
[MISC] Dump model runner inputs when crashing (#8305) 2024-09-12 01:12:25 +00:00
Patrick von Platen
d394787e52
Pixtral (#8377)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-09-11 14:41:55 -07:00
Lily Liu
775f00f81e
[Speculative Decoding] Test refactor (#8317)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-09-11 14:07:34 -07:00
bnellnm
73202dbe77
[Kernel][Misc] register ops to prevent graph breaks (#6917)
Co-authored-by: Sage Moore <sage@neuralmagic.com>
2024-09-11 12:52:19 -07:00
Li, Jiang
0b952af458
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (#7257) 2024-09-11 09:46:46 -07:00
Yang Fan
3b7fea770f
[Model][VLM] Add Qwen2-VL model support (#7905)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-11 09:31:19 -07:00
Pooya Davoodi
cea95dfb94
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch (#8347) 2024-09-11 05:30:11 +00:00
Yangshen⚡Deng
6a512a00df
[model] Support for Llava-Next-Video model (#7559)
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-10 22:21:36 -07:00
Pavani Majety
efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (#6112) 2024-09-11 00:38:40 -04:00
Isotr0py
1230263e16
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (#8299) 2024-09-11 10:11:01 +08:00
Cyrus Leung
8c054b7a62
[Frontend] Clean up type annotations for mistral tokenizer (#8314) 2024-09-10 16:49:11 +00:00
Dipika Sikka
6cd5e5b07e
[Misc] Fused MoE Marlin support for GPTQ (#8217) 2024-09-09 23:02:52 -04:00
Kyle Sayers
c7cb5c3335
[Misc] GPTQ Activation Ordering (#8135) 2024-09-09 16:27:26 -04:00
Kyle Mistele
08287ef675
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (#8272) 2024-09-09 10:45:11 -04:00
Joe Runde
cfe712bf1a
[CI/Build] Use python 3.12 in cuda image (#8133)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-07 13:03:16 -07:00
Isotr0py
e807125936
[Model][VLM] Support multi-images inputs for InternVL2 models (#8201) 2024-09-07 16:38:23 +08:00
Cyrus Leung
9f68e00d27
[Bugfix] Fix broken OpenAI tensorizer test (#8258) 2024-09-07 08:02:39 +00:00
youkaichao
ce2702a923
[tpu][misc] fix typo (#8260) 2024-09-06 22:40:46 -07:00
Cyrus Leung
2f707fcb35
[Model] Multi-input support for LLaVA (#8238) 2024-09-07 02:57:24 +00:00
Patrick von Platen
29f49cd6e3
[Model] Allow loading from original Mistral format (#8168)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-06 17:02:05 -06:00
Alexey Kondratiev(AMD)
1447c97e75
[CI/Build] Increasing timeout for multiproc worker tests (#8203) 2024-09-06 11:51:03 -07:00
afeldman-nm
e5cab71531
[Frontend] Add --logprobs argument to benchmark_serving.py (#8191) 2024-09-06 09:01:14 -07:00
Jiaxin Shan
db3bf7c991
[Core] Support load and unload LoRA in api server (#6566)
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-09-05 18:10:33 -07:00
Alex Brooks
9da25a88aa
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-05 12:48:10 +00:00
manikandan.tm@zucisystems.com
8685ba1a1e
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) (#7860) 2024-09-05 11:33:37 +00:00
Elfie Guo
e39ebf5cf5
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173) 2024-09-05 05:12:26 +00:00
Kyle Mistele
e02ce498be
[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (#5649)
Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com>
Co-authored-by: Kyle Mistele <kyle@constellate.ai>
2024-09-04 13:18:13 -07:00
Woosuk Kwon
561d6f8077
[CI] Change test input in Gemma LoRA test (#8163) 2024-09-04 13:05:50 -07:00
alexeykondrat
d1dec64243
[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-09-04 11:57:54 -07:00
Cody Yu
2ad2e5608e
[MISC] Consolidate FP8 kv-cache tests (#8131) 2024-09-04 18:53:25 +00:00
Cyrus Leung
855c262a6b
[Frontend] Multimodal support in offline chat (#8098) 2024-09-04 05:22:17 +00:00
Peter Salas
2be8ec6e71
[Model] Add Ultravox support for multiple audio chunks (#7963) 2024-09-04 04:38:21 +00:00
Dipika Sikka
2188a60c7e
[Misc] Update GPTQ to use vLLMParameters (#7976) 2024-09-03 17:21:44 -04:00
Alexander Matveev
6d646d08a2
[Core] Optimize Async + Multi-step (#8050) 2024-09-03 18:50:29 +00:00
wang.yuqi
6e36f4fa6c
improve chunked prefill performance
[Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874)
2024-09-02 14:20:12 -07:00
Lily Liu
e6a26ed037
[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244) 2024-09-01 21:23:29 -07:00
Shawn Tan
f8d60145b4
[Model] Add Granite model (#7436)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-09-01 18:37:18 -07:00
Roger Wang
5b86b19954
[Misc] Optional installation of audio related packages (#8063) 2024-09-01 14:46:57 -07:00
Roger Wang
5231f0898e
[Frontend][VLM] Add support for multiple multi-modal items (#8049) 2024-08-31 16:35:53 -07:00
Pavani Majety
622f8abff8
[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013) 2024-08-30 22:18:50 -07:00
Wenxiang
1248e8506a
[Model] Adding support for MSFT Phi-3.5-MoE (#7729)
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Zeqi Lin <zelin@microsoft.com>
Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>
2024-08-30 13:42:57 -06:00
Kaunil Dhruv
058344f89a
[Frontend]-config-cli-args (#7737)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>
2024-08-30 08:21:02 -07:00
Jungho Christopher Cho
f97be32d1d
[VLM][Model] TP support for ViTs (#7186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-08-30 08:19:27 -07:00
afeldman-nm
428dd1445e
[Core] Logprobs support in Multi-step (#7652) 2024-08-29 19:19:08 -07:00
Cyrus Leung
4abed65c58
[VLM] Disallow overflowing max_model_len for multimodal models (#7998) 2024-08-29 17:49:04 -07:00
chenqianfzh
4664ceaad6
support bitsandbytes 8-bit and FP4 quantized models (#7445) 2024-08-29 19:09:08 -04:00
Pavani Majety
6b3421567d
[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-29 14:53:11 -04:00
Alexander Matveev
3f60f2244e
[Core] Combine async postprocessor and multi-step (#7921) 2024-08-29 11:18:26 -07:00
Jonas M. Kübler
f205c09854
[Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899) 2024-08-28 22:18:13 -07:00
youkaichao
ef99a78760
Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982) 2024-08-28 21:27:06 -07:00
Peter Salas
74d5543ec5
[VLM][Core] Fix exceptions on ragged NestedTensors (#7974) 2024-08-29 03:24:31 +00:00
youkaichao
a7f65c2be9
[torch.compile] remove reset (#7975) 2024-08-28 17:32:26 -07:00
youkaichao
ce6bf3a2cf
[torch.compile] avoid Dynamo guard evaluation overhead (#7898)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-08-28 16:10:12 -07:00
Mor Zusman
fdd9daafa3
[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651) 2024-08-28 15:06:52 -07:00
rasmith
e5697d161c
[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386) 2024-08-28 15:37:47 -04:00
Pavani Majety
b98cc28f91
[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-28 10:01:22 -07:00
Cody Yu
e3580537a4
[Performance] Enable chunked prefill and prefix caching together (#7753) 2024-08-28 00:36:31 -07:00
Cyrus Leung
51f86bf487
[mypy][CI/Build] Fix mypy errors (#7929) 2024-08-27 23:47:44 -07:00
Peter Salas
fab5f53e2d
[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902) 2024-08-28 01:53:56 +00:00
zifeitong
5340a2dccf
[Model] Add multi-image input support for LLaVA-Next offline inference (#7230) 2024-08-28 07:09:02 +08:00
Dipika Sikka
fc911880cc
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-27 15:07:09 -07:00
Isotr0py
9db642138b
[CI/Build][VLM] Cleanup multiple images inputs model test (#7897) 2024-08-27 15:28:30 +00:00
Patrick von Platen
6fc4e6e07a
[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739) 2024-08-27 12:40:02 +00:00
youkaichao
64cc644425
[core][torch.compile] discard the compile for profiling (#7796) 2024-08-26 21:33:58 -07:00
Nick Hill
39178c7fbc
[Tests] Disable retries and use context manager for openai client (#7565) 2024-08-26 21:33:17 -07:00
Megha Agarwal
2eedede875
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
Dipika Sikka
665304092d
[Misc] Update qqq to use vLLMParameters (#7805) 2024-08-26 13:16:15 -06:00
Cody Yu
2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822) 2024-08-26 11:24:53 -07:00
Cyrus Leung
029c71de11
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer (#7836) 2024-08-26 05:31:10 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
0b769992ec
[Bugfix]: Use float32 for base64 embedding (#7855)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2024-08-26 03:16:38 +00:00
Nick Hill
1856aff4d6
[Spec Decoding] Streamline batch expansion tensor manipulation (#7851) 2024-08-25 15:45:14 -07:00
Isotr0py
2059b8d9ca
[Misc] Remove snapshot_download usage in InternVL2 test (#7835) 2024-08-25 15:53:09 +00:00
Isotr0py
8aaf3d5347
[Model][VLM] Support multi-images inputs for Phi-3-vision models (#7783) 2024-08-25 11:51:20 +00:00
zifeitong
80162c44b1
[Bugfix] Fix Phi-3v crash when input images are of certain sizes (#7840) 2024-08-24 18:16:24 -07:00
youkaichao
aab0fcdb63
[ci][test] fix RemoteOpenAIServer (#7838) 2024-08-24 17:31:28 +00:00
youkaichao
ea9fa160e3
[ci][test] exclude model download time in server start time (#7834) 2024-08-24 01:03:27 -07:00
youkaichao
7d9ffa2ae1
[misc][core] lazy import outlines (#7831) 2024-08-24 00:51:38 -07:00
Tyler Rockwood
d81abefd2e
[Frontend] add json_schema support from OpenAI protocol (#7654) 2024-08-23 23:07:24 -07:00
Pooya Davoodi
8da48e4d95
[Frontend] Publish Prometheus metrics in run_batch API (#7641) 2024-08-23 23:04:22 -07:00
Alexander Matveev
9db93de20c
[Core] Add multi-step support to LLMEngine (#7789) 2024-08-23 12:45:53 -07:00
Dipika Sikka
f1df5dbfd6
[Misc] Update marlin to use vLLMParameters (#7803) 2024-08-23 14:30:52 -04:00
Maximilien de Bayser
e25fee57c2
[BugFix] Fix server crash on empty prompt (#7746)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-08-23 13:12:44 +00:00
SangBin Cho
c01a6cb231
[Ray backend] Better error when pg topology is bad. (#7584)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-22 17:44:25 -07:00
Joe Runde
b903e1ba7f
[Frontend] error suppression cleanup (#7786)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 21:50:21 +00:00
Travis Johnson
cc0eaf12b1
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-22 09:33:48 -04:00
Dipika Sikka
955b5191c9
[Misc] update fp8 to use vLLMParameter (#7437) 2024-08-22 08:36:18 -04:00
Abhinav Goyal
a3fce56b88
[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830) 2024-08-22 02:42:24 -07:00
Michael Goin
aae74ef95c
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764) 2024-08-22 03:42:14 +00:00
Joe Runde
cde9183b40
[Bug][Frontend] Improve ZMQ client robustness (#7443)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 02:18:11 +00:00
zifeitong
df1a21131d
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710) 2024-08-22 09:36:24 +08:00
Luka Govedič
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce (#7233)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00
Dipika Sikka
8678a69ab5
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-21 16:17:10 -07:00
Peter Salas
1ca0d4f86b
[Model] Add UltravoxModel and UltravoxConfig (#7615) 2024-08-21 22:49:39 +00:00
Robert Shaw
970dfdc01d
[Frontend] Improve Startup Failure UX (#7716) 2024-08-21 19:53:01 +00:00
Robert Shaw
f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend (#7394)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-21 13:34:14 -04:00
LI MOU
53328d7536
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] (#7509) 2024-08-21 08:54:31 -07:00
Nick Hill
c75363fbc0
[BugFix] Avoid premature async generator exit and raise all exception variations (#7698) 2024-08-21 11:45:55 -04:00
Cyrus Leung
baaedfdb2d
[mypy] Enable following imports for entrypoints (#7248)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Fei <dfdfcai4@gmail.com>
2024-08-20 23:28:21 -07:00
Isotr0py
12e1c65bc9
[Model] Add AWQ quantization support for InternVL2 model (#7187) 2024-08-20 23:18:57 -07:00
youkaichao
9e51b6a626
[ci][test] adjust max wait time for cpu offloading test (#7709) 2024-08-20 17:12:44 -07:00
Antoni Baum
3b682179dd
[Core] Add AttentionState abstraction (#7663) 2024-08-20 18:50:45 +00:00
Isotr0py
aae6927be0
[VLM][Model] Add test for InternViT vision encoder (#7409) 2024-08-20 23:10:20 +08:00
Lucas Wilkinson
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174) 2024-08-20 07:09:33 -06:00
Abhinav Goyal
312f761232
[Speculative Decoding] Fixing hidden states handling in batch expansion (#7508) 2024-08-19 17:58:14 -07:00
Isotr0py
7601cb044d
[Core] Support tensor parallelism for GGUF quantization (#7520) 2024-08-19 17:30:14 -04:00
William Lin
47b65a5508
[core] Multi Step Scheduling (#7000)
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
2024-08-19 13:52:13 -07:00
Cody Yu
3ac50b47d0
[MISC] Add prefix cache hit rate to metrics (#7606) 2024-08-19 11:52:07 -07:00
Peng Guanwen
f710fb5265
[Core] Use flashinfer sampling kernel when available (#7137)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-19 03:24:03 +00:00
SangBin Cho
ff7ec82c4d
[Core] Optimize SPMD architecture with delta + serialization optimization (#7109) 2024-08-18 17:57:20 -07:00
Alex Brooks
40e1360bb6
[CI/Build] Add text-only test for Qwen models (#7475)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-08-19 07:43:46 +08:00
Robert Shaw
e3b318216d
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend (#7279)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-18 20:19:48 +00:00
Roger Wang
bbf55c4805
[VLM] Refactor MultiModalConfig initialization and profiling (#7530) 2024-08-17 13:30:55 -07:00
youkaichao
832163b875
[ci][test] allow longer wait time for api server (#7629) 2024-08-17 11:26:38 -07:00
youkaichao
5bf45db7df
[ci][test] fix engine/logger test (#7621) 2024-08-16 23:00:59 -07:00
SangBin Cho
4706eb628e
[aDAG] Unflake aDAG + PP tests (#7600) 2024-08-16 20:49:30 -07:00
Mahesh Keralapura
93478b63d2
[Core] Fix tracking of model forward time in case of PP>1 (#7440)
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)
2024-08-16 13:46:01 -07:00
Mor Zusman
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE (#7415) 2024-08-16 10:06:51 -07:00
Charlie Fu
e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm (#7210) 2024-08-16 10:06:30 -07:00
youkaichao
54bd9a03c4
register custom op for flash attn and use from torch.ops (#7536) 2024-08-15 22:38:56 -07:00
jon-chuang
50b8d08dbd
[Misc/Testing] Use torch.testing.assert_close (#7324) 2024-08-16 04:24:04 +00:00
Michael Goin
e165528778
[CI] Move quantization cpu offload tests out of fastcheck (#7574) 2024-08-15 21:16:20 -07:00
nunjunj
3b19e39dc5
Chat method for offline llm (#5049)
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-08-15 19:41:34 -07:00
youkaichao
4cd7d47fed
[ci/test] rearrange tests and make adag test soft fail (#7572) 2024-08-15 19:39:04 -07:00
Grant Pinkert
f878c8feb0
[Feature]: Add OpenAI server prompt_logprobs support #6508 (#7453) 2024-08-16 02:38:08 +00:00
shangmingc
b67ae00cdb
[Misc] Add quantization config support for speculative model. (#7343) 2024-08-15 19:34:28 -07:00
Kyle Sayers
f55a9aea45
[Misc] Revert compressed-tensors code reuse (#7521) 2024-08-14 15:07:37 -07:00
Cyrus Leung
3f674a49b5
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (#7126) 2024-08-14 17:55:42 +00:00
Wallas Henrique
70b746efcf
[Misc] Deprecation Warning when setting --engine-use-ray (#7424)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-14 09:44:27 -07:00
youkaichao
ea49e6a3c8
[misc][ci] fix cpu test with plugins (#7489) 2024-08-13 19:27:46 -07:00
Jee Jee Li
97992802f3
[CI/Build]Reduce the time consumption for LoRA tests (#7396) 2024-08-13 17:27:29 -07:00
youkaichao
16422ea76f
[misc][plugin] add plugin system implementation (#7426) 2024-08-13 16:24:17 -07:00
Kyle Sayers
373538f973
[Misc] compressed-tensors code reuse (#7277) 2024-08-13 19:05:15 -04:00
youkaichao
33e5d7e6b6
[frontend] spawn engine process from api server process (#7484) 2024-08-13 15:40:17 -07:00
Dipika Sikka
b1e5afc3e7
[Misc] Update awq and awq_marlin to use vLLMParameters (#7422) 2024-08-13 17:08:20 -04:00
Dipika Sikka
fb377d7e74
[Misc] Update gptq_marlin to use new vLLMParameters (#7281) 2024-08-13 14:30:11 -04:00
Peter Salas
00c3d68e45
[Frontend][Core] Add plumbing to support audio language models (#7446) 2024-08-13 17:39:33 +00:00
Cyrus Leung
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 (#7410) 2024-08-13 05:33:41 +00:00
Andrew Wang
97a6be95ba
[Misc] improve logits processors logging message (#7435) 2024-08-13 02:29:34 +00:00
Cyrus Leung
9ba85bc152
[mypy] Misc. typing improvements (#7417) 2024-08-13 09:20:20 +08:00
Rui Qiao
198d6a2898
[Core] Shut down aDAG workers with clean async llm engine exit (#7224)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-12 17:57:16 -07:00
jon-chuang
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 22:47:41 +00:00
Roger Wang
e6e42e4b17
[Core][VLM] Support image embeddings as input (#6613) 2024-08-12 16:16:06 +08:00
Isotr0py
4c5d8e8ea9
[Bugfix] Fix phi3v batch inference when images have different aspect ratio (#7392) 2024-08-10 16:19:33 +00:00
Cade Daniel
baa240252e
[Core] Fix edge case in chunked prefill + block manager v2 (#7380) 2024-08-09 23:48:49 +00:00
Mahesh Keralapura
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time (#7089) 2024-08-09 13:55:13 -07:00
Pooya Davoodi
249b88228d
[Frontend] Support embeddings in the run_batch API (#7132)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-09 09:48:21 -07:00
Nick Hill
b4e9528f95
[Core] Streamline stream termination in AsyncLLMEngine (#7336) 2024-08-09 07:06:36 +00:00
William Lin
57b7be0e1c
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971) 2024-08-09 05:42:45 +00:00
Travis Johnson
99b4cf5f23
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-08 22:08:46 -07:00
Cyrus Leung
7eb4a51c5f
[Core] Support serving encoder/decoder models (#7258) 2024-08-09 10:39:41 +08:00
Zach Zheng
782e53ab59
[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849) 2024-08-08 10:43:30 -07:00
Joe Runde
21b9c49aa3
[Frontend] Kill the server on engine death (#6594)
Signed-off-by: Joe Runde <joe@joerun.de>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-08 09:47:48 -07:00
Luka Govedič
5fb4a3f678
[Bugfix][Kernel] Increased atol to fix failing tests (#7305) 2024-08-08 12:16:13 -04:00
Michael Goin
5223199e03
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization (#7219) 2024-08-07 11:23:12 -07:00
Maximilien de Bayser
fde47d3bc2
[BugFix] Fix frontend multiprocessing hang (#7217)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-08-07 18:09:36 +00:00
Isotr0py
b764547616
[Bugfix] Fix input processor for InternVL2 model (#7164)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-07 09:32:07 -07:00
Dipika Sikka
0f7052bc7e
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 (#5874) 2024-08-07 09:17:58 -07:00
Cyrus Leung
66d617e343
[Frontend] Gracefully handle missing chat template and fix CI failure (#7238)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-08-07 09:12:05 +00:00
Nick Hill
9a3f49ae07
[BugFix] Overhaul async request cancellation (#7111) 2024-08-07 13:21:41 +08:00
Michael Goin
f9a5600649
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading (#7225) 2024-08-06 18:34:26 -07:00
afeldman-nm
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942)
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
Luka Govedič
8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues (#5941)
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-08-06 18:17:08 +00:00
Lily Liu
5c60c8c423
[SpecDecode] [Minor] Fix spec decode sampler tests (#7183) 2024-08-06 10:40:32 -07:00
Cyrus Leung
1f26efbb3a
[Model] Support SigLIP encoder and alternative decoders for LLaVA models (#7153)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-08-06 16:55:31 +08:00
Jee Jee Li
9118217f58
[LoRA] Relax LoRA condition (#7146) 2024-08-06 01:57:25 +00:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model (#5191)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
youkaichao
dfb1a15dcb
[ci][frontend] deduplicate tests (#7101) 2024-08-05 15:59:22 -07:00
Cade Daniel
82a1b1a82b
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification (#6963) 2024-08-05 08:46:44 +00:00
Alphi
7b86e7c9cd
[Model] Add multi-image support for minicpmv (#7122)
Co-authored-by: hezhihui <hzh7269@modelbest.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-05 09:23:17 +08:00
Yihuan Bu
654bc5ca49
Support for guided decoding for offline LLM (#6878)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-04 03:12:09 +00:00
youkaichao
44dcb52e39
[ci][test] finalize fork_new_process_for_each_test (#7114) 2024-08-03 10:44:53 -07:00
Jee Jee Li
99d7cabd7b
[LoRA] ReplicatedLinear support LoRA (#7081) 2024-08-02 22:40:19 -07:00
Zach Zheng
fb2c1c86c1
[Bugfix] Fix block table for seqs that have prefix cache hits (#7018) 2024-08-02 22:38:15 -07:00
youkaichao
a0d164567c
[ci][distributed] disable ray dag tests (#7099) 2024-08-02 22:32:04 -07:00
youkaichao
04e5583425
[ci][distributed] merge distributed test commands (#7097)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-02 21:33:53 -07:00
youkaichao
69ea15e5cc
[ci][distributed] shorten wait time if server hangs (#7098) 2024-08-02 21:05:16 -07:00
Robert Shaw
ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with zeromq (#6883)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-02 18:27:28 -07:00
Rui Qiao
05308891e2
[Core] Pipeline parallel with Ray ADAG (#6837)
Support pipeline-parallelism with Ray accelerated DAG.

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-02 13:55:40 -07:00
Lucas Wilkinson
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType (#6396) 2024-08-02 13:51:58 -07:00
youkaichao
806949514a
[ci] set timeout for test_oot_registration.py (#7082) 2024-08-02 10:03:24 -07:00
youkaichao
252357793d
[ci][distributed] try to fix pp test (#7054) 2024-08-01 22:03:12 -07:00
Woosuk Kwon
805a8a75f2
[Misc] Support attention logits soft-capping with flash-attn (#7022) 2024-08-01 13:14:37 -07:00
Michael Goin
fb3db61688
[CI/Build] Remove sparseml requirement from testing (#7037) 2024-08-01 12:00:51 -07:00
youkaichao
c8a7e93273
[core][scheduler] simplify and improve scheduler (#6867) 2024-07-31 23:51:09 -07:00
zifeitong
3c10591ef2
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user (#6954) 2024-07-31 21:13:34 -07:00
Jee Jee Li
7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton (#5036) 2024-07-31 17:12:24 -07:00
Michael Goin
460c1884e3
[Bugfix] Support cpu offloading with fp8 quantization (#6960) 2024-07-31 12:47:46 -07:00
Cody Yu
bd70013407
[MISC] Introduce pipeline parallelism partition strategies (#6920)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-07-31 12:02:17 -07:00
Cyrus Leung
daed30c4a9
[Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982) 2024-07-31 23:46:17 +08:00
HandH1998
6512937de1
Support W4A8 quantization for vllm (#5218) 2024-07-31 07:55:21 -06:00
Cyrus Leung
f230cc2ca6
[Bugfix] Fix broadcasting logic for multi_modal_kwargs (#6836) 2024-07-31 10:38:45 +08:00
Tyler Michael Smith
d7a299edaa
[Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842) 2024-07-30 16:37:01 -04:00
Sanger Steel
052b6f8ca4
[Bugfix] Fix tensorizer memory profiling bug during testing (#6881) 2024-07-30 11:48:50 -07:00
Nick Hill
5cf9254a9c
[BugFix] Fix use of per-request seed with pipeline parallel (#6698) 2024-07-30 10:40:08 -07:00
Varun Sundar Rabindranath
af647fb8b3
[Kernel] Tuned int8 kernels for Ada Lovelace (#6848)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 20:24:58 -06:00
Nick Hill
9f69d8245a
[Frontend] New allowed_token_ids decoding request parameter (#6753) 2024-07-29 23:37:27 +00:00
Thomas Parnell
9a7e2d0534
[Bugfix] Allow vllm to still work if triton is not installed. (#6786)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-29 14:51:27 -07:00
Peng Guanwen
db9e5708a9
[Core] Reduce unnecessary compute when logprobs=None (#6532) 2024-07-29 16:47:31 +00:00
Varun Sundar Rabindranath
766435e660
[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-07-29 09:42:35 -06:00
Isotr0py
7cbd9ec7a9
[Model] Initialize support for InternVL2 series models (#6514)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-29 10:16:30 +00:00
Alexander Matveev
75acdaa4b6
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795) 2024-07-27 17:52:33 -04:00
Cyrus Leung
1ad86acf17
[Model] Initial support for BLIP-2 (#5920)
Co-authored-by: ywang96 <ywang@roblox.com>
2024-07-27 11:53:07 +00:00
Joe
14dbd5a767
[Model] H2O Danube3-4b (#6451) 2024-07-26 20:47:50 -07:00
Sanger Steel
969d032265
[Bugfix]: Fix Tensorizer test failures (#6835) 2024-07-26 20:02:25 -07:00
youkaichao
443c7cf4cf
[ci][distributed] fix flaky tests (#6806) 2024-07-25 17:44:09 -07:00
Michael Goin
65b1f121c8
[Bugfix] Fix kv_cache_dtype=fp8 without scales for FP8 checkpoints (#6761) 2024-07-25 09:46:15 -07:00
Chang Su
316a41ac1d
[Bugfix] Fix encoding_format in examples/openai_embedding_client.py (#6755) 2024-07-24 22:48:07 -07:00
Cody Yu
309aaef825
[Bugfix] Fix decode tokens w. CUDA graph (#6757) 2024-07-24 22:33:56 -07:00