Breno Faria
87d41c849d
[BUGFIX] [FRONTEND] Correct chat logprobs ( #5029 )
...
Co-authored-by: Breno Faria <breno.faria@intrafind.com>
2024-05-30 02:52:14 -07:00
Cyrus Leung
b1c255630d
[Core] Avoid the need to pass None values to Sequence.inputs ( #5099 )
2024-05-29 16:05:01 -07:00
Cyrus Leung
eecd864388
[Bugfix][CI/Build] Fix test and improve code for merge_async_iterators ( #5096 )
2024-05-29 16:02:25 -07:00
afeldman-nm
4238bc82f2
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) ( #4837 )
2024-05-29 16:09:13 +00:00
Cyrus Leung
18c1f16d86
[Bugfix] Fix arguments passed to Sequence in stop checker test ( #5092 )
2024-05-29 07:16:41 +00:00
youkaichao
5bd3c65072
[Core][Optimization] remove vllm-nccl ( #5091 )
2024-05-29 05:13:52 +00:00
Junichi Sato
dfba529b40
[Bugfix] Remove the last EOS token unless explicitly specified ( #5077 )
2024-05-28 17:15:35 -07:00
Cyrus Leung
5ae5ed1e60
[Core] Consolidate prompt arguments to LLM engines ( #4328 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-05-28 13:29:31 -07:00
Michał Moskal
d4f3985907
[Core] Sliding window for block manager v2 ( #4545 )
...
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
2024-05-28 11:07:07 +09:00
Zhuohan Li
1102bef219
[Bugfix / Core] Prefix Caching Guards (merged with main) ( #4846 )
...
Co-authored-by: rsnm2 <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-27 15:18:17 -07:00
Lily Liu
d5a1697772
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding ( #5000 )
2024-05-25 10:00:14 -07:00
Eric Xihui Lin
8e192ff967
[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model ( #4799 )
...
Co-authored-by: beagleski <yunanzhang@microsoft.com>
Co-authored-by: bapatra <bapatra@microsoft.com>
Co-authored-by: Barun Patra <codedecde@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-05-24 22:00:52 -07:00
leiwen83
e64fde4b01
[Core][Bugfix]: fix prefix caching for blockv2 ( #4764 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-24 10:07:09 -07:00
Robert Shaw
919770957f
[Bugfix] Fix Mistral v0.3 Weight Loading ( #5005 )
...
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-05-24 12:28:27 +00:00
Dipika Sikka
a1242324c9
[Kernel] Initial Activation Quantization Support ( #4525 )
...
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
2024-05-23 21:29:18 +00:00
Murali Andoorveedu
5eda2ea02a
[Core][1/N] Support send/recv in PyNCCL Groups ( #4988 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-05-23 09:54:48 -07:00
Alexander Matveev
6066253296
Marlin 24 prefill performance improvement (about 25% better on average) ( #4983 )
2024-05-23 02:39:27 -04:00
Cody Yu
ee3eea0a1b
[Misc] Take user preference in attention selector ( #4960 )
2024-05-23 07:55:56 +09:00
raywanb
97b030005c
[Model] LoRA gptbigcode implementation ( #3949 )
2024-05-22 13:58:59 -07:00
Cody Yu
a3a73ab069
[Misc] Load FP8 kv-cache scaling factors from checkpoints ( #4893 )
...
The 2nd PR for #4532 .
This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).
2024-05-22 13:28:20 -07:00
Tyler Michael Smith
8674f9880e
[Kernel] Fixup for CUTLASS kernels in CUDA graphs ( #4954 )
...
Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs
2024-05-22 14:10:43 +00:00
SangBin Cho
c74c913bfb
[misc] remove comments that were supposed to be removed ( #4977 )
2024-05-22 09:02:58 -04:00
sasha0552
9b9a10d6cb
[Frontend] Dynamic RoPE scaling ( #4638 )
2024-05-22 01:32:35 -04:00
Isotr0py
f12c3b5b3d
[Model] Add Phi-2 LoRA support ( #4886 )
2024-05-21 14:24:17 +09:00
Alexei-V-Ivanov-AMD
943e72ca56
[Build/CI] Enabling AMD Entrypoints Test ( #4834 )
...
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>
2024-05-20 11:29:28 -07:00
Woosuk Kwon
b57e6c5949
[Kernel] Add flash-attn back ( #4907 )
2024-05-19 18:11:30 -07:00
Alexander Matveev
27ce85476e
[Kernel] Add marlin_24 unit tests ( #4901 )
2024-05-19 11:37:34 -04:00
Cyrus Leung
f68470e803
[Bugfix][Model] Add base class for vision-language models ( #4809 )
2024-05-19 00:13:33 -07:00
SangBin Cho
2e9a2227ec
[Lora] Support long context lora ( #4787 )
...
Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through.
It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors.
Follow up of https://github.com/vllm-project/vllm/pull/3095/files
2024-05-18 16:05:23 +09:00
Jinzhen Lin
33e0823de5
[Bugfix] fix rope error when load models with different dtypes ( #4835 )
2024-05-17 18:43:34 +09:00
Alexei-V-Ivanov-AMD
26148120b3
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests ( #4797 )
2024-05-16 20:58:25 -07:00
Tyler Michael Smith
2060e93659
[Kernel] Add w8a8 CUTLASS kernels ( #4749 )
2024-05-16 18:32:50 -04:00
Silencio
8435b207af
[Kernel] Add punica dimension for Qwen1.5-32B LoRA ( #4850 )
...
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
2024-05-16 11:16:09 -07:00
youkaichao
e08188081b
[Core][Distributed] remove graph mode function ( #4818 )
2024-05-16 10:59:52 -07:00
Alexander Matveev
6979ade384
Add GPTQ Marlin 2:4 sparse structured support ( #4790 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-05-16 12:56:15 -04:00
Jinzhen Lin
99caa49106
[Kernel] add bfloat16 support for gptq marlin kernel ( #4788 )
2024-05-16 09:55:29 -04:00
alexm-nm
5c342570d7
Add marlin unit tests and marlin benchmark script ( #4815 )
2024-05-16 09:36:49 -04:00
Cody Yu
973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding ( #4840 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
2024-05-16 00:53:51 -07:00
Aurick Qiao
30e754390c
[Core] Implement sharded state loader ( #4690 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-05-15 22:11:54 -07:00
Alex Wu
52f8107cf2
[Frontend] Support OpenAI batch file format ( #4794 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-15 19:13:36 -04:00
Cyrus Leung
fc0d9dfc3a
[Frontend] Re-enable custom roles in Chat Completions API ( #4758 )
2024-05-15 14:58:46 -07:00
Cyrus Leung
e9cdd2b1e2
[CI/Build] Further decouple HuggingFace implementation from ours during tests ( #4166 )
2024-05-14 23:38:40 -07:00
SangBin Cho
65bf2ac165
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API ( #4681 )
...
This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
It also refactors subquery_start_loc which was not refactored in the previous PR
2024-05-15 14:00:10 +09:00
SangBin Cho
8a7cc254a0
Revert "[Kernel] Use flash-attn for decoding ( #3648 )" ( #4820 )
...
Lora 3 & 4 test seems to have illegal memory access failure after this commit;
[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241
This reverts commit 1356df5 .
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
2024-05-15 11:52:45 +09:00
Nick Hill
676a99982f
[Core] Add MultiprocessingGPUExecutor ( #4539 )
...
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
2024-05-14 10:38:59 -07:00
Stephen Krider
1356df53bd
[Kernel] Use flash-attn for decoding ( #3648 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2024-05-13 15:50:33 -07:00
Cody Yu
ce532ff45c
[Speculative decoding] Improve n-gram efficiency ( #4724 )
2024-05-13 15:00:13 -07:00
Sanger Steel
8bc68e198c
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer to version 2.9.0 ( #4208 )
2024-05-13 14:57:07 -07:00
Woosuk Kwon
0fca3cdcf2
[Misc] Enhance attention selector ( #4751 )
2024-05-13 10:47:25 -07:00
SangBin Cho
e7c46b9527
[Scheduler] Warning upon preemption and Swapping ( #4647 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-13 23:50:44 +09:00
Cyrus Leung
350f9e107f
[CI/Build] Move test_utils.py to tests/utils.py ( #4425 )
...
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)
Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
youkaichao
702bee461f
[Core][Distributed] refactor custom allreduce to support multiple tp groups ( #4754 )
2024-05-12 17:47:59 -07:00
Robert Shaw
a709e87a4f
[CI/Build] Tweak Marlin Nondeterminism Issues ( #4713 )
2024-05-12 17:46:31 -07:00
Chang Su
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
youkaichao
4e12131089
[Core][Test] fix function name typo in custom allreduce ( #4750 )
2024-05-10 15:14:40 -07:00
Robert Shaw
fcc2994be6
[CI] Nits for bad initialization of SeqGroup in testing ( #4748 )
2024-05-10 18:01:01 -04:00
heeju-kim2
2e7796f2cf
[Speculative decoding] CUDA graph support ( #4295 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-05-10 17:36:25 +00:00
SangBin Cho
6a0f617210
[Core] Fix circular reference which leaked llm instance in local dev env ( #4737 )
...
Storing exception frame is extremely prone to circular refernece because it contains the reference to objects.
When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem.
I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.
2024-05-10 23:54:32 +09:00
Allen.Dou
e965d46184
[Misc] Keep only one implementation of the create_dummy_prompt function. ( #4716 )
2024-05-09 21:42:38 -07:00
youkaichao
208b71bcc1
[Core][Distributed] refactor pynccl ( #4591 )
...
[Core][Distributed] refactor pynccl to hold multiple communicators (#4591 )
2024-05-09 19:48:43 -07:00
Cody Yu
c833101740
[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support ( #4535 )
2024-05-09 18:04:17 -06:00
Woosuk Kwon
0ee535b294
[Misc] Set block size at initialization & Fix test_model_runner ( #4705 )
2024-05-09 09:04:59 -07:00
Woosuk Kwon
190bc838e1
[Misc] Remove unnecessary ModelRunner imports ( #4703 )
2024-05-09 00:17:17 -07:00
Cyrus Leung
f12b20decc
[Frontend] Move async logic outside of constructor ( #4674 )
2024-05-08 22:48:33 -07:00
Cody Yu
f942efb5a3
[Dynamic Spec Decoding] Auto-disable by the running queue size ( #4592 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-05-08 21:44:00 +00:00
youkaichao
230c4b38c1
[CI/Test] fix swap test for multi gpu ( #4689 )
2024-05-08 13:14:02 -07:00
youkaichao
20cfcdec99
[Core][Optimization] change python dict to pytorch tensor for blocks to swap ( #4659 )
2024-05-08 12:07:05 -07:00
DefTruth
0f9a6e3d22
[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi ( #4573 )
2024-05-08 09:19:58 -07:00
SangBin Cho
f6a593093a
[CI] Make mistral tests pass ( #4596 )
2024-05-08 08:44:35 -07:00
youkaichao
cc466a3290
[Core][Distributed] support cpu&device in broadcast tensor dict ( #4660 )
...
[Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660 )
2024-05-07 19:34:47 -07:00
leiwen83
8344f7742b
[Bug fix][Core] fixup ngram not setup correctly ( #4551 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-05-07 11:40:18 -07:00
youkaichao
469f85c782
[Core][Optimization] change copy-on-write from dict[int, list] to list ( #4648 )
2024-05-07 11:06:32 -07:00
youkaichao
63575bc2e1
[Core][Optimization] change python dict to pytorch tensor ( #4607 )
2024-05-06 21:30:27 -07:00
DearPlanet
4302987069
[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics ( #3937 )
2024-05-04 15:39:34 -07:00
Michael Goin
2a052011ca
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) ( #4527 )
...
Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436 .
This PR enables the following checkpoint loading features for Mixtral:
Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model
Supports static or dynamic activation quantization with static weight quantization (all per tensor)
Supports different scales for each expert weight
Supports Fp8 in QKV layer
Notes:
The Expert Gate/Router always runs at half / full precision for now.
If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.
2024-05-04 11:45:16 -07:00
Cody Yu
bc8ad68455
[Misc][Refactor] Introduce ExecuteModelData ( #4540 )
2024-05-03 17:47:07 -07:00
Cade Daniel
ab50275111
[Speculative decoding] Support target-model logprobs ( #4378 )
2024-05-03 15:52:01 -07:00
Lily Liu
43c413ec57
[Kernel] Use flashinfer for decoding ( #4353 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
2024-05-03 15:51:27 -07:00
Sebastian Schoennenbeck
f8e7adda21
Fix/async chat serving ( #2727 )
2024-05-03 11:04:14 -07:00
SangBin Cho
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
youkaichao
344a5d0c33
[Core][Distributed] enable allreduce for multiple tp groups ( #4566 )
2024-05-02 17:32:33 -07:00
SangBin Cho
0f8a91401c
[Core] Ignore infeasible swap requests. ( #4557 )
2024-05-02 14:31:20 -07:00
Michał Moskal
32881f3f31
[kernel] fix sliding window in prefix prefill Triton kernel ( #4405 )
...
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2024-05-02 11:23:37 -07:00
alexm-nm
7038e8b803
[Kernel] Support running GPTQ 8-bit models in Marlin ( #4533 )
2024-05-02 12:56:22 -04:00
youkaichao
2a85f93007
[Core][Distributed] enable multiple tp group ( #4512 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-02 04:28:21 +00:00
Ronen Schaffer
5e401bce17
[CI]Add regression tests to ensure the async engine generates metrics ( #4524 )
2024-05-01 19:57:12 -07:00
SangBin Cho
0d62fe58db
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption ( #4451 )
2024-05-01 19:24:13 -07:00
Danny Guinther
b8afa8b95a
[MISC] Rework logger to enable pythonic custom logging configuration to be provided ( #4273 )
2024-05-01 17:34:40 -07:00
sasha0552
c47ba4aaa9
[Bugfix] Add validation for seed ( #4529 )
2024-05-01 19:31:22 +00:00
Nick Hill
a657bfc48a
[Core] Add multiproc_worker_utils for multiprocessing-based workers ( #4357 )
2024-05-01 18:41:59 +00:00
leiwen83
24750f4cad
[Core] Enable prefix caching with block manager v2 enabled ( #4142 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
2024-05-01 11:20:32 -07:00
leiwen83
b38e42fbca
[Speculative decoding] Add ngram prompt lookup decoding ( #4237 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-01 11:13:03 -07:00
SangBin Cho
6f1df80436
[Test] Add ignore_eos test ( #4519 )
2024-05-01 08:45:42 -04:00
Jee Li
d6f4bd7cdd
[Misc]Add customized information for models ( #4132 )
2024-04-30 21:18:14 -07:00
Robert Caulk
c3845d82dc
Allow user to define whitespace pattern for outlines ( #4305 )
2024-04-30 20:48:39 -07:00
Florian Greinacher
a494140433
[Frontend] Support complex message content for chat completions endpoint ( #3467 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-04-30 16:28:46 -07:00
Robert Shaw
111815d482
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) ( #4332 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-04-30 21:46:12 +00:00
leiwen83
4bb53e2dde
[BugFix] fix num_lookahead_slots missing in async executor ( #4165 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-04-30 10:12:59 -07:00
youkaichao
f4f921b7f1
[Core][Distributed] use cpu group to broadcast metadata in cpu ( #4444 )
2024-04-29 13:52:22 -07:00
Robert Shaw
73c8d677e5
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin ( #3922 )
...
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-29 09:35:34 -07:00