squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Tyler Michael Smith	2060e93659	[Kernel] Add w8a8 CUTLASS kernels (#4749 )	2024-05-16 18:32:50 -04:00
Silencio	8435b207af	[Kernel] Add punica dimension for Qwen1.5-32B LoRA (#4850 ) Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>	2024-05-16 11:16:09 -07:00
youkaichao	e08188081b	[Core][Distributed] remove graph mode function (#4818 )	2024-05-16 10:59:52 -07:00
Alexander Matveev	6979ade384	Add GPTQ Marlin 2:4 sparse structured support (#4790 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-05-16 12:56:15 -04:00
Jinzhen Lin	99caa49106	[Kernel] add bfloat16 support for gptq marlin kernel (#4788 )	2024-05-16 09:55:29 -04:00
alexm-nm	5c342570d7	Add marlin unit tests and marlin benchmark script (#4815 )	2024-05-16 09:36:49 -04:00
Cody Yu	973617ae02	[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840 ) Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cade Daniel <cade@anyscale.com>	2024-05-16 00:53:51 -07:00
Aurick Qiao	30e754390c	[Core] Implement sharded state loader (#4690 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-05-15 22:11:54 -07:00
Alex Wu	52f8107cf2	[Frontend] Support OpenAI batch file format (#4794 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-15 19:13:36 -04:00
Cyrus Leung	fc0d9dfc3a	[Frontend] Re-enable custom roles in Chat Completions API (#4758 )	2024-05-15 14:58:46 -07:00
Cyrus Leung	e9cdd2b1e2	[CI/Build] Further decouple HuggingFace implementation from ours during tests (#4166 )	2024-05-14 23:38:40 -07:00
SangBin Cho	65bf2ac165	[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681 ) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR	2024-05-15 14:00:10 +09:00
SangBin Cho	8a7cc254a0	Revert "[Kernel] Use flash-attn for decoding (#3648 )" (#4820 ) Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit `1356df5`. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)	2024-05-15 11:52:45 +09:00
Nick Hill	676a99982f	[Core] Add MultiprocessingGPUExecutor (#4539 ) Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>	2024-05-14 10:38:59 -07:00
Stephen Krider	1356df53bd	[Kernel] Use flash-attn for decoding (#3648 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2024-05-13 15:50:33 -07:00
Cody Yu	ce532ff45c	[Speculative decoding] Improve n-gram efficiency (#4724 )	2024-05-13 15:00:13 -07:00
Sanger Steel	8bc68e198c	[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208 )	2024-05-13 14:57:07 -07:00
Woosuk Kwon	0fca3cdcf2	[Misc] Enhance attention selector (#4751 )	2024-05-13 10:47:25 -07:00
SangBin Cho	e7c46b9527	[Scheduler] Warning upon preemption and Swapping (#4647 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-13 23:50:44 +09:00
Cyrus Leung	350f9e107f	[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425 ) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.	2024-05-13 23:50:09 +09:00
youkaichao	702bee461f	[Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754 )	2024-05-12 17:47:59 -07:00
Robert Shaw	a709e87a4f	[CI/Build] Tweak Marlin Nondeterminism Issues (#4713 )	2024-05-12 17:46:31 -07:00
Chang Su	e254497b66	[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734 )	2024-05-11 11:30:37 -07:00
youkaichao	4e12131089	[Core][Test] fix function name typo in custom allreduce (#4750 )	2024-05-10 15:14:40 -07:00
Robert Shaw	fcc2994be6	[CI] Nits for bad initialization of SeqGroup in testing (#4748 )	2024-05-10 18:01:01 -04:00
heeju-kim2	2e7796f2cf	[Speculative decoding] CUDA graph support (#4295 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-05-10 17:36:25 +00:00
SangBin Cho	6a0f617210	[Core] Fix circular reference which leaked llm instance in local dev env (#4737 ) Storing exception frame is extremely prone to circular refernece because it contains the reference to objects. When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem. I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.	2024-05-10 23:54:32 +09:00
Allen.Dou	e965d46184	[Misc] Keep only one implementation of the create_dummy_prompt function. (#4716 )	2024-05-09 21:42:38 -07:00
youkaichao	208b71bcc1	[Core][Distributed] refactor pynccl (#4591 ) [Core][Distributed] refactor pynccl to hold multiple communicators (#4591)	2024-05-09 19:48:43 -07:00
Cody Yu	c833101740	[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535 )	2024-05-09 18:04:17 -06:00
Woosuk Kwon	0ee535b294	[Misc] Set block size at initialization & Fix test_model_runner (#4705 )	2024-05-09 09:04:59 -07:00
Woosuk Kwon	190bc838e1	[Misc] Remove unnecessary ModelRunner imports (#4703 )	2024-05-09 00:17:17 -07:00
Cyrus Leung	f12b20decc	[Frontend] Move async logic outside of constructor (#4674 )	2024-05-08 22:48:33 -07:00
Cody Yu	f942efb5a3	[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-05-08 21:44:00 +00:00
youkaichao	230c4b38c1	[CI/Test] fix swap test for multi gpu (#4689 )	2024-05-08 13:14:02 -07:00
youkaichao	20cfcdec99	[Core][Optimization] change python dict to pytorch tensor for blocks to swap (#4659 )	2024-05-08 12:07:05 -07:00
DefTruth	0f9a6e3d22	[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#4573 )	2024-05-08 09:19:58 -07:00
SangBin Cho	f6a593093a	[CI] Make mistral tests pass (#4596 )	2024-05-08 08:44:35 -07:00
youkaichao	cc466a3290	[Core][Distributed] support cpu&device in broadcast tensor dict (#4660 ) [Core][Distributed] support both cpu and device tensor in broadcast tensor dict (#4660)	2024-05-07 19:34:47 -07:00
leiwen83	8344f7742b	[Bug fix][Core] fixup ngram not setup correctly (#4551 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-05-07 11:40:18 -07:00
youkaichao	469f85c782	[Core][Optimization] change copy-on-write from dict[int, list] to list (#4648 )	2024-05-07 11:06:32 -07:00
youkaichao	63575bc2e1	[Core][Optimization] change python dict to pytorch tensor (#4607 )	2024-05-06 21:30:27 -07:00
DearPlanet	4302987069	[Bugfix] Fix inappropriate content of model_name tag in Prometheus metrics (#3937 )	2024-05-04 15:39:34 -07:00
Michael Goin	2a052011ca	[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527 ) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.	2024-05-04 11:45:16 -07:00
Cody Yu	bc8ad68455	[Misc][Refactor] Introduce ExecuteModelData (#4540 )	2024-05-03 17:47:07 -07:00
Cade Daniel	ab50275111	[Speculative decoding] Support target-model logprobs (#4378 )	2024-05-03 15:52:01 -07:00
Lily Liu	43c413ec57	[Kernel] Use flashinfer for decoding (#4353 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>	2024-05-03 15:51:27 -07:00
Sebastian Schoennenbeck	f8e7adda21	Fix/async chat serving (#2727 )	2024-05-03 11:04:14 -07:00
SangBin Cho	3521ba4f25	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
youkaichao	344a5d0c33	[Core][Distributed] enable allreduce for multiple tp groups (#4566 )	2024-05-02 17:32:33 -07:00

1 2 3 4 5 ...

299 Commits