zifeitong
df1a21131d
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue ( #7710 )
2024-08-22 09:36:24 +08:00
Luka Govedič
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce ( #7233 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00
Dipika Sikka
8678a69ab5
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel ( #7527 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
2024-08-21 16:17:10 -07:00
Peter Salas
1ca0d4f86b
[Model] Add UltravoxModel and UltravoxConfig ( #7615 )
2024-08-21 22:49:39 +00:00
Robert Shaw
970dfdc01d
[Frontend] Improve Startup Failure UX ( #7716 )
2024-08-21 19:53:01 +00:00
Robert Shaw
f7e3b0c5aa
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend ( #7394 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-21 13:34:14 -04:00
LI MOU
53328d7536
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] ( #7509 )
2024-08-21 08:54:31 -07:00
Nick Hill
c75363fbc0
[BugFix] Avoid premature async generator exit and raise all exception variations ( #7698 )
2024-08-21 11:45:55 -04:00
Cyrus Leung
baaedfdb2d
[mypy] Enable following imports for entrypoints ( #7248 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Fei <dfdfcai4@gmail.com>
2024-08-20 23:28:21 -07:00
Isotr0py
12e1c65bc9
[Model] Add AWQ quantization support for InternVL2 model ( #7187 )
2024-08-20 23:18:57 -07:00
youkaichao
9e51b6a626
[ci][test] adjust max wait time for cpu offloading test ( #7709 )
2024-08-20 17:12:44 -07:00
Antoni Baum
3b682179dd
[Core] Add AttentionState abstraction ( #7663 )
2024-08-20 18:50:45 +00:00
Isotr0py
aae6927be0
[VLM][Model] Add test for InternViT vision encoder ( #7409 )
2024-08-20 23:10:20 +08:00
Lucas Wilkinson
5288c06aa0
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel ( #7174 )
2024-08-20 07:09:33 -06:00
Abhinav Goyal
312f761232
[Speculative Decoding] Fixing hidden states handling in batch expansion ( #7508 )
2024-08-19 17:58:14 -07:00
Isotr0py
7601cb044d
[Core] Support tensor parallelism for GGUF quantization ( #7520 )
2024-08-19 17:30:14 -04:00
William Lin
47b65a5508
[core] Multi Step Scheduling ( #7000 )
...
Co-authored-by: afeldman-nm <156691304+afeldman-nm@users.noreply.github.com>
2024-08-19 13:52:13 -07:00
Cody Yu
3ac50b47d0
[MISC] Add prefix cache hit rate to metrics ( #7606 )
2024-08-19 11:52:07 -07:00
Peng Guanwen
f710fb5265
[Core] Use flashinfer sampling kernel when available ( #7137 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-19 03:24:03 +00:00
SangBin Cho
ff7ec82c4d
[Core] Optimize SPMD architecture with delta + serialization optimization ( #7109 )
2024-08-18 17:57:20 -07:00
Alex Brooks
40e1360bb6
[CI/Build] Add text-only test for Qwen models ( #7475 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
2024-08-19 07:43:46 +08:00
Robert Shaw
e3b318216d
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend ( #7279 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-18 20:19:48 +00:00
Roger Wang
bbf55c4805
[VLM] Refactor MultiModalConfig initialization and profiling ( #7530 )
2024-08-17 13:30:55 -07:00
youkaichao
832163b875
[ci][test] allow longer wait time for api server ( #7629 )
2024-08-17 11:26:38 -07:00
youkaichao
5bf45db7df
[ci][test] fix engine/logger test ( #7621 )
2024-08-16 23:00:59 -07:00
SangBin Cho
4706eb628e
[aDAG] Unflake aDAG + PP tests ( #7600 )
2024-08-16 20:49:30 -07:00
Mahesh Keralapura
93478b63d2
[Core] Fix tracking of model forward time in case of PP>1 ( #7440 )
...
[Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440 )
2024-08-16 13:46:01 -07:00
Mor Zusman
7fc23be81c
[Kernel] W8A16 Int8 inside FusedMoE ( #7415 )
2024-08-16 10:06:51 -07:00
Charlie Fu
e837b624f2
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm ( #7210 )
2024-08-16 10:06:30 -07:00
youkaichao
54bd9a03c4
register custom op for flash attn and use from torch.ops ( #7536 )
2024-08-15 22:38:56 -07:00
jon-chuang
50b8d08dbd
[Misc/Testing] Use torch.testing.assert_close ( #7324 )
2024-08-16 04:24:04 +00:00
Michael Goin
e165528778
[CI] Move quantization cpu offload tests out of fastcheck ( #7574 )
2024-08-15 21:16:20 -07:00
nunjunj
3b19e39dc5
Chat method for offline llm ( #5049 )
...
Co-authored-by: nunjunj <ray@g-3ff9f30f2ed650001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-1df6075697c3f0001.c.vllm-405802.internal>
Co-authored-by: nunjunj <ray@g-c5a2c23abc49e0001.c.vllm-405802.internal>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-08-15 19:41:34 -07:00
youkaichao
4cd7d47fed
[ci/test] rearrange tests and make adag test soft fail ( #7572 )
2024-08-15 19:39:04 -07:00
Grant Pinkert
f878c8feb0
[Feature]: Add OpenAI server prompt_logprobs support #6508 ( #7453 )
2024-08-16 02:38:08 +00:00
shangmingc
b67ae00cdb
[Misc] Add quantization config support for speculative model. ( #7343 )
2024-08-15 19:34:28 -07:00
Kyle Sayers
f55a9aea45
[Misc] Revert compressed-tensors code reuse ( #7521 )
2024-08-14 15:07:37 -07:00
Cyrus Leung
3f674a49b5
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt ( #7126 )
2024-08-14 17:55:42 +00:00
Wallas Henrique
70b746efcf
[Misc] Deprecation Warning when setting --engine-use-ray ( #7424 )
...
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-14 09:44:27 -07:00
youkaichao
ea49e6a3c8
[misc][ci] fix cpu test with plugins ( #7489 )
2024-08-13 19:27:46 -07:00
Jee Jee Li
97992802f3
[CI/Build]Reduce the time consumption for LoRA tests ( #7396 )
2024-08-13 17:27:29 -07:00
youkaichao
16422ea76f
[misc][plugin] add plugin system implementation ( #7426 )
2024-08-13 16:24:17 -07:00
Kyle Sayers
373538f973
[Misc] compressed-tensors code reuse ( #7277 )
2024-08-13 19:05:15 -04:00
youkaichao
33e5d7e6b6
[frontend] spawn engine process from api server process ( #7484 )
2024-08-13 15:40:17 -07:00
Dipika Sikka
b1e5afc3e7
[Misc] Update awq and awq_marlin to use vLLMParameters ( #7422 )
2024-08-13 17:08:20 -04:00
Dipika Sikka
fb377d7e74
[Misc] Update gptq_marlin to use new vLLMParameters ( #7281 )
2024-08-13 14:30:11 -04:00
Peter Salas
00c3d68e45
[Frontend][Core] Add plumbing to support audio language models ( #7446 )
2024-08-13 17:39:33 +00:00
Cyrus Leung
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 ( #7410 )
2024-08-13 05:33:41 +00:00
Andrew Wang
97a6be95ba
[Misc] improve logits processors logging message ( #7435 )
2024-08-13 02:29:34 +00:00
Cyrus Leung
9ba85bc152
[mypy] Misc. typing improvements ( #7417 )
2024-08-13 09:20:20 +08:00
Rui Qiao
198d6a2898
[Core] Shut down aDAG workers with clean async llm engine exit ( #7224 )
...
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-12 17:57:16 -07:00
jon-chuang
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel ( #7208 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 22:47:41 +00:00
Roger Wang
e6e42e4b17
[Core][VLM] Support image embeddings as input ( #6613 )
2024-08-12 16:16:06 +08:00
Isotr0py
4c5d8e8ea9
[Bugfix] Fix phi3v batch inference when images have different aspect ratio ( #7392 )
2024-08-10 16:19:33 +00:00
Cade Daniel
baa240252e
[Core] Fix edge case in chunked prefill + block manager v2 ( #7380 )
2024-08-09 23:48:49 +00:00
Mahesh Keralapura
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time ( #7089 )
2024-08-09 13:55:13 -07:00
Pooya Davoodi
249b88228d
[Frontend] Support embeddings in the run_batch API ( #7132 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-09 09:48:21 -07:00
Nick Hill
b4e9528f95
[Core] Streamline stream termination in AsyncLLMEngine ( #7336 )
2024-08-09 07:06:36 +00:00
William Lin
57b7be0e1c
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace ( #6971 )
2024-08-09 05:42:45 +00:00
Travis Johnson
99b4cf5f23
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary ( #7218 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-08 22:08:46 -07:00
Cyrus Leung
7eb4a51c5f
[Core] Support serving encoder/decoder models ( #7258 )
2024-08-09 10:39:41 +08:00
Zach Zheng
782e53ab59
[Bugfix][fast] Fix the get_num_blocks_touched logic ( #6849 )
2024-08-08 10:43:30 -07:00
Joe Runde
21b9c49aa3
[Frontend] Kill the server on engine death ( #6594 )
...
Signed-off-by: Joe Runde <joe@joerun.de>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-08 09:47:48 -07:00
Luka Govedič
5fb4a3f678
[Bugfix][Kernel] Increased atol to fix failing tests ( #7305 )
2024-08-08 12:16:13 -04:00
Michael Goin
5223199e03
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization ( #7219 )
2024-08-07 11:23:12 -07:00
Maximilien de Bayser
fde47d3bc2
[BugFix] Fix frontend multiprocessing hang ( #7217 )
...
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-08-07 18:09:36 +00:00
Isotr0py
b764547616
[Bugfix] Fix input processor for InternVL2 model ( #7164 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-07 09:32:07 -07:00
Dipika Sikka
0f7052bc7e
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 ( #5874 )
2024-08-07 09:17:58 -07:00
Cyrus Leung
66d617e343
[Frontend] Gracefully handle missing chat template and fix CI failure ( #7238 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-08-07 09:12:05 +00:00
Nick Hill
9a3f49ae07
[BugFix] Overhaul async request cancellation ( #7111 )
2024-08-07 13:21:41 +08:00
Michael Goin
f9a5600649
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading ( #7225 )
2024-08-06 18:34:26 -07:00
afeldman-nm
fd95e026e0
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) ( #4942 )
...
Co-authored-by: Andrew Feldman <afeld2012@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-08-06 16:51:47 -04:00
Luka Govedič
8d59dbb000
[Kernel] Add per-tensor and per-token AZP epilogues ( #5941 )
...
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
2024-08-06 18:17:08 +00:00
Lily Liu
5c60c8c423
[SpecDecode] [Minor] Fix spec decode sampler tests ( #7183 )
2024-08-06 10:40:32 -07:00
Cyrus Leung
1f26efbb3a
[Model] Support SigLIP encoder and alternative decoders for LLaVA models ( #7153 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-08-06 16:55:31 +08:00
Jee Jee Li
9118217f58
[LoRA] Relax LoRA condition ( #7146 )
2024-08-06 01:57:25 +00:00
Isotr0py
360bd67cf0
[Core] Support loading GGUF model ( #5191 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-05 17:54:23 -06:00
youkaichao
dfb1a15dcb
[ci][frontend] deduplicate tests ( #7101 )
2024-08-05 15:59:22 -07:00
Cade Daniel
82a1b1a82b
[Speculative decoding] Add periodic log with time spent in proposal/scoring/verification ( #6963 )
2024-08-05 08:46:44 +00:00
Alphi
7b86e7c9cd
[Model] Add multi-image support for minicpmv ( #7122 )
...
Co-authored-by: hezhihui <hzh7269@modelbest.cn>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-05 09:23:17 +08:00
Yihuan Bu
654bc5ca49
Support for guided decoding for offline LLM ( #6878 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-04 03:12:09 +00:00
youkaichao
44dcb52e39
[ci][test] finalize fork_new_process_for_each_test ( #7114 )
2024-08-03 10:44:53 -07:00
Jee Jee Li
99d7cabd7b
[LoRA] ReplicatedLinear support LoRA ( #7081 )
2024-08-02 22:40:19 -07:00
Zach Zheng
fb2c1c86c1
[Bugfix] Fix block table for seqs that have prefix cache hits ( #7018 )
2024-08-02 22:38:15 -07:00
youkaichao
a0d164567c
[ci][distributed] disable ray dag tests ( #7099 )
2024-08-02 22:32:04 -07:00
youkaichao
04e5583425
[ci][distributed] merge distributed test commands ( #7097 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-08-02 21:33:53 -07:00
youkaichao
69ea15e5cc
[ci][distributed] shorten wait time if server hangs ( #7098 )
2024-08-02 21:05:16 -07:00
Robert Shaw
ed812a73fa
[ Frontend ] Multiprocessing for OpenAI Server with zeromq ( #6883 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <joe@joerun.de>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-02 18:27:28 -07:00
Rui Qiao
05308891e2
[Core] Pipeline parallel with Ray ADAG ( #6837 )
...
Support pipeline-parallelism with Ray accelerated DAG.
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-02 13:55:40 -07:00
Lucas Wilkinson
a8d604ca2a
[Misc] Disambiguate quantized types via a new ScalarType ( #6396 )
2024-08-02 13:51:58 -07:00
youkaichao
806949514a
[ci] set timeout for test_oot_registration.py ( #7082 )
2024-08-02 10:03:24 -07:00
youkaichao
252357793d
[ci][distributed] try to fix pp test ( #7054 )
2024-08-01 22:03:12 -07:00
Woosuk Kwon
805a8a75f2
[Misc] Support attention logits soft-capping with flash-attn ( #7022 )
2024-08-01 13:14:37 -07:00
Michael Goin
fb3db61688
[CI/Build] Remove sparseml requirement from testing ( #7037 )
2024-08-01 12:00:51 -07:00
youkaichao
c8a7e93273
[core][scheduler] simplify and improve scheduler ( #6867 )
2024-07-31 23:51:09 -07:00
zifeitong
3c10591ef2
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not provided by user ( #6954 )
2024-07-31 21:13:34 -07:00
Jee Jee Li
7ecee34321
[Kernel][RFC] Refactor the punica kernel based on Triton ( #5036 )
2024-07-31 17:12:24 -07:00
Michael Goin
460c1884e3
[Bugfix] Support cpu offloading with fp8 quantization ( #6960 )
2024-07-31 12:47:46 -07:00
Cody Yu
bd70013407
[MISC] Introduce pipeline parallelism partition strategies ( #6920 )
...
Co-authored-by: youkaichao <youkaichao@126.com>
2024-07-31 12:02:17 -07:00
Cyrus Leung
daed30c4a9
[Bugfix] Fix feature size calculation for LLaVA-NeXT ( #6982 )
2024-07-31 23:46:17 +08:00