squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
leiwen83	1b8a0d71cf	[Core][Bugfix]: fix prefix caching for blockv2 (#5364 ) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-06-14 17:23:56 -07:00
youkaichao	48f589e18b	[mis] fix flaky test of test_cuda_device_count_stateless (#5546 )	2024-06-14 10:02:23 -07:00
Antoni Baum	50eed24d25	Add `cuda_device_count_stateless` (#5473 )	2024-06-13 16:06:49 -07:00
Tyler Michael Smith	33e3b37242	[CI/Build] Disable test_fp8.py (#5508 )	2024-06-13 13:37:48 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
Cyrus Leung	39873476f8	[CI/Build] Simplify OpenAI server setup in tests (#5100 )	2024-06-13 11:21:53 -07:00
Michael Goin	23ec72fa03	[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466 )	2024-06-13 15:18:08 +00:00
Dipika Sikka	c2637a613b	[Kernel] `w4a16` support for `compressed-tensors` (#5385 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 10:19:56 -04:00
youkaichao	ea3890a5f0	[Core][Distributed] code deduplication in tp&pp with coordinator(#5293 ) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)	2024-06-12 17:27:08 -07:00
Travis Johnson	51602eefd3	[Frontend] [Core] Support for sharded tensorized models (#4990 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-12 14:13:52 -07:00
Cody Yu	5985e3427d	[Kernel] Vectorized FP8 quantize kernel (#5396 ) Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.	2024-06-12 14:07:26 -07:00
SangBin Cho	847cdcca1c	[CI] Upgrade codespell version. (#5381 )	2024-06-12 10:06:14 -07:00
Simon Mo	e3c12bf6d2	Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463 )	2024-06-12 10:03:24 -07:00
Michael Goin	3dd6853bc8	[CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253 )	2024-06-12 09:58:02 -07:00
Nick Hill	99dac099ab	[Core][Doc] Default to multiprocessing for single-node distributed case (#5230 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-11 11:10:41 -07:00
youkaichao	c4bd03c7c5	[Core][Distributed] add same-node detection (#5369 )	2024-06-11 10:53:59 -07:00
sasha0552	dcbf4286af	[Frontend] Customizable RoPE theta (#5197 )	2024-06-11 10:42:26 -07:00
Cyrus Leung	640052b069	[Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026 )	2024-06-10 22:36:46 -07:00
maor-ps	351d5e7b82	[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-11 10:30:31 +08:00
Itay Etelis	774d1035e4	[Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (#5319 )	2024-06-10 14:22:09 +00:00
Cyrus Leung	6b29d6fe70	[Model] Initial support for LLaVA-NeXT (#4199 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-10 12:47:15 +00:00
Cyrus Leung	0bfa1c4f13	[Misc] Improve error message when LoRA parsing fails (#5194 )	2024-06-10 19:38:49 +08:00
Dipika Sikka	5884c2b454	[Misc] Update to comply with the new `compressed-tensors` config (#5350 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-06-10 03:49:46 +00:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
youkaichao	5d7e3d0176	[mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361 ) [mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361)	2024-06-09 03:50:14 +00:00
youkaichao	8ea5e44a43	[CI/Test] improve robustness of test (vllm_runner) (#5357 ) [CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357)	2024-06-08 08:59:20 +00:00
youkaichao	9fb900f90c	[CI/Test] improve robustness of test (hf_runner) (#5347 ) [CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347)	2024-06-07 22:31:32 -07:00
Roger Wang	7a9cb294ae	[Frontend] Add OpenAI Vision API Support (#5237 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-07 11:23:32 -07:00
Dipika Sikka	ca3ea51bde	[Kernel] Dynamic Per-Token Activation Quantization (#5037 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-07 09:36:26 -07:00
youkaichao	388596c914	[Misc][Utils] allow get_open_port to be called for multiple times (#5333 )	2024-06-06 22:15:11 -07:00
Itay Etelis	baa15a9ec3	[Feature][Frontend]: Add support for `stream_options` in `ChatCompletionRequest` (#5135 )	2024-06-07 03:29:24 +00:00
Antoni Baum	ccdc490dda	[Core] Change LoRA embedding sharding to support loading methods (#5038 )	2024-06-06 19:07:57 -07:00
Matthew Goldey	828da0d44e	[Frontend] enable passing multiple LoRA adapters at once to generate() (#5300 )	2024-06-06 15:48:13 -05:00
liuyhwangyh	4efff036f0	Bugfix: fix broken of download models from modelscope (#5233 ) Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>	2024-06-06 09:28:10 -07:00
Cyrus Leung	89c920785f	[CI/Build] Update vision tests (#5307 )	2024-06-06 05:17:18 -05:00
Breno Faria	7b0a0dfb22	[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109 ) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Breno Faria <breno.faria@intrafind.com>	2024-06-05 16:49:12 -07:00
Nick Hill	faf71bcd4b	[Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252 )	2024-06-05 14:53:05 -07:00
Woosuk Kwon	41ca62cf03	[Misc] Add CustomOp interface for device portability (#5255 )	2024-06-05 09:18:19 -07:00
zifeitong	974fc9b845	[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to True (#5226 )	2024-06-04 19:37:28 -07:00
Cyrus Leung	9ba093b4f4	[CI/Build] Simplify model loading for `HfRunner` (#5251 )	2024-06-04 10:09:19 -07:00
Cyrus Leung	ec784b2526	[CI/Build] Add inputs tests (#5215 )	2024-06-03 21:01:46 -07:00
afeldman-nm	f42a006b15	[Bugfix]: During testing, use pytest monkeypatch for safely overriding the env var that indicates the vLLM backend (#5210 )	2024-06-03 20:32:57 -07:00
Toshiki Kataoka	06b2550cbb	[Bugfix] Support `prompt_logprobs==0` (#5217 )	2024-06-03 17:59:30 -07:00
Breno Faria	f775a07e30	[FRONTEND] OpenAI `tools` support named functions (#5032 )	2024-06-03 18:25:29 -05:00
Kaiyang Chen	10c38e3e46	[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834 )	2024-06-03 13:37:11 -07:00
Yuan	cafb8e06c5	[CI/BUILD] enable intel queue for longer CPU tests (#4113 )	2024-06-03 10:39:50 -07:00
Tyler Michael Smith	cbb2f59cc8	[Kernel] Pass a device pointer into the quantize kernel for the scales (#5159 )	2024-06-03 09:52:30 -07:00
Cyrus Leung	7a64d24aad	[Core] Support image processor (#4197 )	2024-06-02 22:56:41 -07:00
Cyrus Leung	dfbe60dc62	[Misc] Simplify code and fix type annotations in `conftest.py` (#5118 )	2024-06-02 16:05:50 -07:00
Simon Mo	ed59a7ed23	Update test_ignore_eos (#4898 )	2024-06-02 02:21:53 +00:00
chenqianfzh	b9c0605a8e	[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776 )	2024-06-01 14:51:10 -06:00
Varun Sundar Rabindranath	f081c3ce4b	[Kernel] Update Cutlass fp8 configs (#5144 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-01 08:46:07 +00:00
Tyler Michael Smith	260d119e86	[Kernel] Refactor CUTLASS kernels to always take scales that reside on the GPU (#5137 )	2024-06-01 06:45:32 +00:00
SnowDist	a22dea54d3	[Model] Support MAP-NEO model (#5081 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-05-30 19:24:41 -07:00
Breno Faria	87d41c849d	[BUGFIX] [FRONTEND] Correct chat logprobs (#5029 ) Co-authored-by: Breno Faria <breno.faria@intrafind.com>	2024-05-30 02:52:14 -07:00
Cyrus Leung	b1c255630d	[Core] Avoid the need to pass `None` values to `Sequence.inputs` (#5099 )	2024-05-29 16:05:01 -07:00
Cyrus Leung	eecd864388	[Bugfix][CI/Build] Fix test and improve code for `merge_async_iterators` (#5096 )	2024-05-29 16:02:25 -07:00
afeldman-nm	4238bc82f2	[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837 )	2024-05-29 16:09:13 +00:00
Cyrus Leung	18c1f16d86	[Bugfix] Fix arguments passed to `Sequence` in stop checker test (#5092 )	2024-05-29 07:16:41 +00:00
youkaichao	5bd3c65072	[Core][Optimization] remove vllm-nccl (#5091 )	2024-05-29 05:13:52 +00:00
Junichi Sato	dfba529b40	[Bugfix] Remove the last EOS token unless explicitly specified (#5077 )	2024-05-28 17:15:35 -07:00
Cyrus Leung	5ae5ed1e60	[Core] Consolidate prompt arguments to LLM engines (#4328 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-05-28 13:29:31 -07:00
Michał Moskal	d4f3985907	[Core] Sliding window for block manager v2 (#4545 ) Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>	2024-05-28 11:07:07 +09:00
Zhuohan Li	1102bef219	[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846 ) Co-authored-by: rsnm2 <rshaw@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-27 15:18:17 -07:00
Lily Liu	d5a1697772	[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000 )	2024-05-25 10:00:14 -07:00
Eric Xihui Lin	8e192ff967	[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799 ) Co-authored-by: beagleski <yunanzhang@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Barun Patra <codedecde@users.noreply.github.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-05-24 22:00:52 -07:00
leiwen83	e64fde4b01	[Core][Bugfix]: fix prefix caching for blockv2 (#4764 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-05-24 10:07:09 -07:00
Robert Shaw	919770957f	[Bugfix] Fix Mistral v0.3 Weight Loading (#5005 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-05-24 12:28:27 +00:00
Dipika Sikka	a1242324c9	[Kernel] Initial Activation Quantization Support (#4525 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-05-23 21:29:18 +00:00
Murali Andoorveedu	5eda2ea02a	[Core][1/N] Support send/recv in PyNCCL Groups (#4988 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-05-23 09:54:48 -07:00
Alexander Matveev	6066253296	Marlin 24 prefill performance improvement (about 25% better on average) (#4983 )	2024-05-23 02:39:27 -04:00
Cody Yu	ee3eea0a1b	[Misc] Take user preference in attention selector (#4960 )	2024-05-23 07:55:56 +09:00
raywanb	97b030005c	[Model] LoRA gptbigcode implementation (#3949 )	2024-05-22 13:58:59 -07:00
Cody Yu	a3a73ab069	[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893 ) The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).	2024-05-22 13:28:20 -07:00
Tyler Michael Smith	8674f9880e	[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954 ) Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs	2024-05-22 14:10:43 +00:00
SangBin Cho	c74c913bfb	[misc] remove comments that were supposed to be removed (#4977 )	2024-05-22 09:02:58 -04:00
sasha0552	9b9a10d6cb	[Frontend] Dynamic RoPE scaling (#4638 )	2024-05-22 01:32:35 -04:00
Isotr0py	f12c3b5b3d	[Model] Add Phi-2 LoRA support (#4886 )	2024-05-21 14:24:17 +09:00
Alexei-V-Ivanov-AMD	943e72ca56	[Build/CI] Enabling AMD Entrypoints Test (#4834 ) Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>	2024-05-20 11:29:28 -07:00
Woosuk Kwon	b57e6c5949	[Kernel] Add flash-attn back (#4907 )	2024-05-19 18:11:30 -07:00
Alexander Matveev	27ce85476e	[Kernel] Add marlin_24 unit tests (#4901 )	2024-05-19 11:37:34 -04:00
Cyrus Leung	f68470e803	[Bugfix][Model] Add base class for vision-language models (#4809 )	2024-05-19 00:13:33 -07:00
SangBin Cho	2e9a2227ec	[Lora] Support long context lora (#4787 ) Currently we need to call rotary embedding kernel for each LoRA, which makes it hard to serve multiple long context length LoRA. Add batched rotary embedding kernel and pipe it through. It replaces the rotary embedding layer to the one that is aware of multiple cos-sin-cache per scaling factors. Follow up of https://github.com/vllm-project/vllm/pull/3095/files	2024-05-18 16:05:23 +09:00
Jinzhen Lin	33e0823de5	[Bugfix] fix rope error when load models with different dtypes (#4835 )	2024-05-17 18:43:34 +09:00
Alexei-V-Ivanov-AMD	26148120b3	[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests (#4797 )	2024-05-16 20:58:25 -07:00
Tyler Michael Smith	2060e93659	[Kernel] Add w8a8 CUTLASS kernels (#4749 )	2024-05-16 18:32:50 -04:00
Silencio	8435b207af	[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850 ) Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>	2024-05-16 11:16:09 -07:00
youkaichao	e08188081b	[Core][Distributed] remove graph mode function (#4818 )	2024-05-16 10:59:52 -07:00
Alexander Matveev	6979ade384	Add GPTQ Marlin 2:4 sparse structured support (#4790 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-05-16 12:56:15 -04:00
Jinzhen Lin	99caa49106	[Kernel] add bfloat16 support for gptq marlin kernel (#4788 )	2024-05-16 09:55:29 -04:00
alexm-nm	5c342570d7	Add marlin unit tests and marlin benchmark script (#4815 )	2024-05-16 09:36:49 -04:00
Cody Yu	973617ae02	[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840 ) Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cade Daniel <cade@anyscale.com>	2024-05-16 00:53:51 -07:00
Aurick Qiao	30e754390c	[Core] Implement sharded state loader (#4690 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-05-15 22:11:54 -07:00
Alex Wu	52f8107cf2	[Frontend] Support OpenAI batch file format (#4794 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-15 19:13:36 -04:00
Cyrus Leung	fc0d9dfc3a	[Frontend] Re-enable custom roles in Chat Completions API (#4758 )	2024-05-15 14:58:46 -07:00
Cyrus Leung	e9cdd2b1e2	[CI/Build] Further decouple HuggingFace implementation from ours during tests (#4166 )	2024-05-14 23:38:40 -07:00
SangBin Cho	65bf2ac165	[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681 ) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR	2024-05-15 14:00:10 +09:00
SangBin Cho	8a7cc254a0	Revert "[Kernel] Use flash-attn for decoding (#3648 )" (#4820 ) Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit `1356df5`. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)	2024-05-15 11:52:45 +09:00
Nick Hill	676a99982f	[Core] Add MultiprocessingGPUExecutor (#4539 ) Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>	2024-05-14 10:38:59 -07:00
Stephen Krider	1356df53bd	[Kernel] Use flash-attn for decoding (#3648 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2024-05-13 15:50:33 -07:00
Cody Yu	ce532ff45c	[Speculative decoding] Improve n-gram efficiency (#4724 )	2024-05-13 15:00:13 -07:00
Sanger Steel	8bc68e198c	[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208 )	2024-05-13 14:57:07 -07:00
Woosuk Kwon	0fca3cdcf2	[Misc] Enhance attention selector (#4751 )	2024-05-13 10:47:25 -07:00
SangBin Cho	e7c46b9527	[Scheduler] Warning upon preemption and Swapping (#4647 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-13 23:50:44 +09:00
Cyrus Leung	350f9e107f	[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425 ) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.	2024-05-13 23:50:09 +09:00
youkaichao	702bee461f	[Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754 )	2024-05-12 17:47:59 -07:00
Robert Shaw	a709e87a4f	[CI/Build] Tweak Marlin Nondeterminism Issues (#4713 )	2024-05-12 17:46:31 -07:00
Chang Su	e254497b66	[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734 )	2024-05-11 11:30:37 -07:00
youkaichao	4e12131089	[Core][Test] fix function name typo in custom allreduce (#4750 )	2024-05-10 15:14:40 -07:00
Robert Shaw	fcc2994be6	[CI] Nits for bad initialization of SeqGroup in testing (#4748 )	2024-05-10 18:01:01 -04:00
heeju-kim2	2e7796f2cf	[Speculative decoding] CUDA graph support (#4295 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-05-10 17:36:25 +00:00
SangBin Cho	6a0f617210	[Core] Fix circular reference which leaked llm instance in local dev env (#4737 ) Storing exception frame is extremely prone to circular refernece because it contains the reference to objects. When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem. I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.	2024-05-10 23:54:32 +09:00
Allen.Dou	e965d46184	[Misc] Keep only one implementation of the create_dummy_prompt function. (#4716 )	2024-05-09 21:42:38 -07:00
youkaichao	208b71bcc1	[Core][Distributed] refactor pynccl (#4591 ) [Core][Distributed] refactor pynccl to hold multiple communicators (#4591)	2024-05-09 19:48:43 -07:00
Cody Yu	c833101740	[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535 )	2024-05-09 18:04:17 -06:00
Woosuk Kwon	0ee535b294	[Misc] Set block size at initialization & Fix test_model_runner (#4705 )	2024-05-09 09:04:59 -07:00
Woosuk Kwon	190bc838e1	[Misc] Remove unnecessary ModelRunner imports (#4703 )	2024-05-09 00:17:17 -07:00
Cyrus Leung	f12b20decc	[Frontend] Move async logic outside of constructor (#4674 )	2024-05-08 22:48:33 -07:00
Cody Yu	f942efb5a3	[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-05-08 21:44:00 +00:00
youkaichao	230c4b38c1	[CI/Test] fix swap test for multi gpu (#4689 )	2024-05-08 13:14:02 -07:00
youkaichao	20cfcdec99	[Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659 )	2024-05-08 12:07:05 -07:00
DefTruth	0f9a6e3d22	[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573 )	2024-05-08 09:19:58 -07:00
SangBin Cho	f6a593093a	[CI] Make mistral tests pass (#4596 )	2024-05-08 08:44:35 -07:00
youkaichao	cc466a3290	[Core][Distributed] support cpu&device in broadcast tensor dict (#4660 ) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)	2024-05-07 19:34:47 -07:00
leiwen83	8344f7742b	[Bug fix][Core] fixup ngram not setup correctly (#4551 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-05-07 11:40:18 -07:00
youkaichao	469f85c782	[Core][Optimization] change copy-on-write from dict[int, list] to list (#4648 )	2024-05-07 11:06:32 -07:00
youkaichao	63575bc2e1	[Core][Optimization] change python dict to pytorch tensor (#4607 )	2024-05-06 21:30:27 -07:00
DearPlanet	4302987069	[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937 )	2024-05-04 15:39:34 -07:00
Michael Goin	2a052011ca	[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527 ) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.	2024-05-04 11:45:16 -07:00
Cody Yu	bc8ad68455	[Misc][Refactor] Introduce ExecuteModelData (#4540 )	2024-05-03 17:47:07 -07:00
Cade Daniel	ab50275111	[Speculative decoding] Support target-model logprobs (#4378 )	2024-05-03 15:52:01 -07:00
Lily Liu	43c413ec57	[Kernel] Use flashinfer for decoding (#4353 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>	2024-05-03 15:51:27 -07:00
Sebastian Schoennenbeck	f8e7adda21	Fix/async chat serving (#2727 )	2024-05-03 11:04:14 -07:00
SangBin Cho	3521ba4f25	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
youkaichao	344a5d0c33	[Core][Distributed] enable allreduce for multiple tp groups (#4566 )	2024-05-02 17:32:33 -07:00
SangBin Cho	0f8a91401c	[Core] Ignore infeasible swap requests. (#4557 )	2024-05-02 14:31:20 -07:00
Michał Moskal	32881f3f31	[kernel] fix sliding window in prefix prefill Triton kernel (#4405 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2024-05-02 11:23:37 -07:00
alexm-nm	7038e8b803	[Kernel] Support running GPTQ 8-bit models in Marlin (#4533 )	2024-05-02 12:56:22 -04:00
youkaichao	2a85f93007	[Core][Distributed] enable multiple tp group (#4512 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-05-02 04:28:21 +00:00
Ronen Schaffer	5e401bce17	[CI]Add regression tests to ensure the async engine generates metrics (#4524 )	2024-05-01 19:57:12 -07:00
SangBin Cho	0d62fe58db	[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451 )	2024-05-01 19:24:13 -07:00
Danny Guinther	b8afa8b95a	[MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273 )	2024-05-01 17:34:40 -07:00
sasha0552	c47ba4aaa9	[Bugfix] Add validation for seed (#4529 )	2024-05-01 19:31:22 +00:00
Nick Hill	a657bfc48a	[Core] Add `multiproc_worker_utils` for multiprocessing-based workers (#4357 )	2024-05-01 18:41:59 +00:00
leiwen83	24750f4cad	[Core] Enable prefix caching with block manager v2 enabled (#4142 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu>	2024-05-01 11:20:32 -07:00
leiwen83	b38e42fbca	[Speculative decoding] Add ngram prompt lookup decoding (#4237 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-05-01 11:13:03 -07:00
SangBin Cho	6f1df80436	[Test] Add ignore_eos test (#4519 )	2024-05-01 08:45:42 -04:00
Jee Li	d6f4bd7cdd	[Misc]Add customized information for models (#4132 )	2024-04-30 21:18:14 -07:00
Robert Caulk	c3845d82dc	Allow user to define whitespace pattern for outlines (#4305 )	2024-04-30 20:48:39 -07:00
Florian Greinacher	a494140433	[Frontend] Support complex message content for chat completions endpoint (#3467 ) Co-authored-by: Lily Liu <lilyliupku@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2024-04-30 16:28:46 -07:00
Robert Shaw	111815d482	[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-04-30 21:46:12 +00:00
leiwen83	4bb53e2dde	[BugFix] fix num_lookahead_slots missing in async executor (#4165 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-04-30 10:12:59 -07:00
youkaichao	f4f921b7f1	[Core][Distributed] use cpu group to broadcast metadata in cpu (#4444 )	2024-04-29 13:52:22 -07:00
Robert Shaw	73c8d677e5	[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922 ) Co-authored-by: alexm <alexm@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-29 09:35:34 -07:00
Prashant Gupta	d6e520e170	[Core] Support offline use of local cache for models (#4374 ) Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>	2024-04-27 09:59:55 -07:00
Nick Hill	81661da7b2	[BugFix] Fix `min_tokens` when `eos_token_id` is None (#4389 ) Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>	2024-04-27 09:52:46 -07:00
Ruoyu Qin	dfea173148	[Bugfix] Abort requests when the connection to /v1/completions is interrupted (#4363 )	2024-04-27 09:48:37 -07:00
Roy	7134303cbb	[Bugfix][Core] Fix get decoding config from ray (#4335 )	2024-04-27 11:30:08 +00:00
Austin Veselka	eefeb16464	[Kernel] Full Tensor Parallelism for LoRA Layers (#3524 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-04-27 00:03:48 -07:00
Cyrus Leung	8947bc3c15	[Frontend][Bugfix] Disallow extra fields in OpenAI API (#4355 )	2024-04-27 05:08:24 +00:00
Cody Yu	a62aaf1df5	[Misc][Refactor] Generalize linear_method to be quant_method (#4373 )	2024-04-26 16:41:14 -04:00
SangBin Cho	603ad84815	[Core] Refactoring sampler and support prompt logprob for chunked prefill (#4309 )	2024-04-26 13:02:02 +00:00
Cyrus Leung	a74dee9b62	[Bugfix] Fix parameter name in `get_tokenizer` (#4107 )	2024-04-25 19:10:48 -07:00
Woosuk Kwon	468d761b32	[Misc] Reduce supported Punica dtypes (#4304 )	2024-04-23 18:54:33 -07:00
youkaichao	91f50a6fe2	[Core][Distributed] use cpu/gloo to initialize pynccl (#4248 )	2024-04-23 18:32:19 -07:00
Cyrus Leung	1e8f4252aa	[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened (#4292 )	2024-04-23 18:19:03 +00:00
James Fleming	2b7949c1c2	AQLM CUDA support (#3287 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-23 13:59:33 -04:00
Cade Daniel	62b8aebc6f	[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951 )	2024-04-23 08:02:36 +00:00
SangBin Cho	050f285ff6	[Core] Scheduling optimization 2 (#4280 )	2024-04-23 08:02:11 +00:00
SangBin Cho	ad8d696a99	[Core] Scheduler perf fix (#4270 )	2024-04-22 21:11:06 +00:00
GeauxEric	a37d815b83	Make initialization of tokenizer and detokenizer optional (#3748 ) Co-authored-by: Yun Ding <yunding@nvidia.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-21 22:06:46 +00:00
nunjunj	91528575ec	[Frontend] multiple sampling params support (#3570 )	2024-04-20 00:11:57 -07:00
Cody Yu	a22cdea371	[Kernel][FP8] Initial support with dynamic per-tensor scaling (#4118 ) Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726 This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine. Algorithm: We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass. Initial Results: Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128: BF16: 1.47s FP8: 1.66s I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.	2024-04-20 04:28:57 +00:00
Ayush Rautwar	138485a82d	[Bugfix] Add fix for JSON whitespace (#4189 ) Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>	2024-04-19 20:49:22 -07:00
Jee Li	d17c8477f1	[Bugfix] Fix LoRA loading check (#4138 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-04-19 00:59:54 -07:00
youkaichao	8a7a3e4436	[Core] add an option to log every function call to for debugging hang/crash in distributed inference (#4079 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-18 16:15:12 -07:00
James Whedbee	e1bb2fd52d	[Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149 )	2024-04-18 21:12:55 +00:00
Michał Moskal	e8cc7967ff	[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128 )	2024-04-18 00:51:28 -07:00
Michael Goin	53b018edcb	[Bugfix] Get available quantization methods from quantization registry (#4098 )	2024-04-18 00:21:55 -07:00
youkaichao	6dc1fc9cfe	[Core] nccl integrity check and test (#4155 ) [Core] Add integrity check during initialization; add test for it (#4155)	2024-04-17 22:28:52 -07:00
Shoichi Uchinami	a53222544c	[Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134 )	2024-04-17 10:02:45 -07:00
youkaichao	8438e0569e	[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication (#4024 ) [Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024)	2024-04-17 08:34:33 +00:00
Cade Daniel	e95cd87959	[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894 )	2024-04-16 13:09:21 -07:00
Antoni Baum	69e1d2fb69	[Core] Refactor model loading code (#4097 )	2024-04-16 11:34:39 -07:00
Noam Gat	05434764cd	LM Format Enforcer Guided Decoding Support (#3868 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-16 05:54:57 +00:00
SangBin Cho	4e7ee664e2	[Core] Fix engine-use-ray broken (#4105 )	2024-04-16 05:24:53 +00:00
Sanger Steel	711a000255	[Frontend] [Core] feat: Add model loading using `tensorizer` (#3476 )	2024-04-13 17:13:01 -07:00
Jee Li	989ae2538d	[Kernel] Add punica dimension for Baichuan-13B (#4053 )	2024-04-13 07:55:05 -07:00
SangBin Cho	36729bac13	[Test] Test multiple attn backend for chunked prefill. (#4023 )	2024-04-12 09:56:57 -07:00
Jee Li	1096717ae9	[Core] Support LoRA on quantized models (#4012 )	2024-04-11 21:02:44 -07:00
Nick Hill	e46a60aa4c	[BugFix] Fix handling of stop strings and stop token ids (#3672 )	2024-04-11 15:34:12 -07:00
Antoni Baum	1e96c3341a	Add extra punica sizes to support bigger vocabs (#4015 )	2024-04-11 22:18:57 +00:00
Dylan Hawk	95e7d4a97c	Fix echo/logprob OpenAI completion bug (#3441 ) Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>	2024-04-11 22:15:50 +00:00
Antoni Baum	a10d3056da	[Core] Set `linear_weights` directly on the layer (#3977 )	2024-04-11 16:35:51 -04:00
Kunshang Ji	e9da5a40c6	[Misc] Add indirection layer for custom ops (#3913 )	2024-04-10 20:26:07 -07:00
SangBin Cho	e42df7227d	[Test] Add xformer and flash attn tests (#3961 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-11 03:09:50 +00:00
SangBin Cho	67b4221a61	[Core][5/N] Fully working chunked prefill e2e (#3884 )	2024-04-10 17:56:48 -07:00
youkaichao	63e7176f26	[Core][Refactor] move parallel_utils into vllm/distributed (#3950 ) [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)	2024-04-10 15:33:30 -07:00
Travis Johnson	0258b7a94b	[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-04-10 01:39:56 -07:00
胡译文	b3104b2a10	[Bugfix] Fix logits processor when prompt_logprobs is not None (#3899 )	2024-04-10 00:09:36 -07:00

... 2 3 4 5 6 ...

534 Commits