squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Alexander Matveev	6979ade384	Add GPTQ Marlin 2:4 sparse structured support (#4790 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-05-16 12:56:15 -04:00
Pierre Dulac	9216b9cc38	[Bugfix] Bypass authorization API token for preflight requests (#4862 )	2024-05-16 09:42:21 -07:00
Alex Wu	5e0391c040	[Frontend] Separate OpenAI Batch Runner usage from API Server (#4851 )	2024-05-17 00:42:41 +09:00
Alex Wu	dbc0754ddf	[docs] Fix typo in examples filename openi -> openai (#4864 )	2024-05-17 00:42:17 +09:00
Jinzhen Lin	99caa49106	[Kernel] add bfloat16 support for gptq marlin kernel (#4788 )	2024-05-16 09:55:29 -04:00
alexm-nm	5c342570d7	Add marlin unit tests and marlin benchmark script (#4815 )	2024-05-16 09:36:49 -04:00
Cody Yu	973617ae02	[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840 ) Co-authored-by: Cade Daniel <edacih@gmail.com> Co-authored-by: Cade Daniel <cade@anyscale.com>	2024-05-16 00:53:51 -07:00
Aurick Qiao	30e754390c	[Core] Implement sharded state loader (#4690 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-05-15 22:11:54 -07:00
Alex Wu	52f8107cf2	[Frontend] Support OpenAI batch file format (#4794 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-15 19:13:36 -04:00
Cyrus Leung	fc0d9dfc3a	[Frontend] Re-enable custom roles in Chat Completions API (#4758 )	2024-05-15 14:58:46 -07:00
Zhuohan Li	361c461a12	[Doc] Highlight the fourth meetup in the README (#4842 )	2024-05-15 11:38:49 -07:00
zifeitong	a5675d348b	[Bugfix] Properly set distributed_executor_backend in ParallelConfig (#4816 )	2024-05-15 07:22:09 -07:00
Cyrus Leung	e9cdd2b1e2	[CI/Build] Further decouple HuggingFace implementation from ours during tests (#4166 )	2024-05-14 23:38:40 -07:00
SangBin Cho	65bf2ac165	[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API (#4681 ) This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend. It also refactors subquery_start_loc which was not refactored in the previous PR	2024-05-15 14:00:10 +09:00
SangBin Cho	8a7cc254a0	Revert "[Kernel] Use flash-attn for decoding (#3648 )" (#4820 ) Lora 3 & 4 test seems to have illegal memory access failure after this commit; [2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered <br class="Apple-interchange-newline"> Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241 This reverts commit `1356df5`. FILL IN THE PR DESCRIPTION HERE FIX #xxxx (link existing issues this PR will resolve)	2024-05-15 11:52:45 +09:00
Simon Mo	29bc01bf3b	Add 4th meetup announcement to readme (#4817 )	2024-05-14 18:33:06 -04:00
Nick Hill	676a99982f	[Core] Add MultiprocessingGPUExecutor (#4539 ) Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>	2024-05-14 10:38:59 -07:00
Cyrus Leung	dc72402b57	[Bugfix][Doc] Fix CI failure in docs (#4804 ) This PR fixes the CI failure introduced by #4798. The failure originates from having duplicate target names in reST, and is fixed by changing the ref targets to anonymous ones. For more information, see this discussion. I have also changed the format of the links to be more distinct from each other.	2024-05-15 01:57:08 +09:00
Kuntai Du	ccb63a8245	[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies (#4696 )	2024-05-14 21:34:33 +09:00
Zhuohan Li	c579b750a0	[Doc] Add meetups to the doc (#4798 )	2024-05-13 18:48:00 -07:00
Cyrus Leung	4bfa7e7f75	[Doc] Add API reference for offline inference (#4710 )	2024-05-13 17:47:42 -07:00
Zhuohan Li	ac1fbf7fd2	[Doc] Shorten README by removing supported model list (#4796 )	2024-05-13 16:23:54 -07:00
Philipp Moritz	33d3914b1e	[Bugfix] Fix dynamic FP8 quantization for Mixtral (#4793 )	2024-05-13 19:00:27 -04:00
Stephen Krider	1356df53bd	[Kernel] Use flash-attn for decoding (#3648 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>	2024-05-13 15:50:33 -07:00
Cody Yu	ce532ff45c	[Speculative decoding] Improve n-gram efficiency (#4724 )	2024-05-13 15:00:13 -07:00
Sanger Steel	8bc68e198c	[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update `tensorizer` to version 2.9.0 (#4208 )	2024-05-13 14:57:07 -07:00
Woosuk Kwon	0fca3cdcf2	[Misc] Enhance attention selector (#4751 )	2024-05-13 10:47:25 -07:00
SangBin Cho	e7c46b9527	[Scheduler] Warning upon preemption and Swapping (#4647 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-13 23:50:44 +09:00
Cyrus Leung	350f9e107f	[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425 ) Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.	2024-05-13 23:50:09 +09:00
youkaichao	702bee461f	[Core][Distributed] refactor custom allreduce to support multiple tp groups (#4754 )	2024-05-12 17:47:59 -07:00
Swapnil Parekh	a7be4d0072	[CORE] Improvement in ranks code (#4718 )	2024-05-12 17:47:47 -07:00
Robert Shaw	a709e87a4f	[CI/Build] Tweak Marlin Nondeterminism Issues (#4713 )	2024-05-12 17:46:31 -07:00
Yikang Shen	6eaccb7353	[Model] Add support for IBM Granite Code models (#4636 )	2024-05-11 21:27:24 -07:00
Chang Su	e254497b66	[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734 )	2024-05-11 11:30:37 -07:00
youkaichao	4e12131089	[Core][Test] fix function name typo in custom allreduce (#4750 )	2024-05-10 15:14:40 -07:00
Robert Shaw	fcc2994be6	[CI] Nits for bad initialization of SeqGroup in testing (#4748 )	2024-05-10 18:01:01 -04:00
heeju-kim2	2e7796f2cf	[Speculative decoding] CUDA graph support (#4295 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-05-10 17:36:25 +00:00
Allen.Dou	706588a77d	[Bugfix] Fix CLI arguments in OpenAI server docs (#4729 )	2024-05-11 00:00:56 +09:00
SangBin Cho	6a0f617210	[Core] Fix circular reference which leaked llm instance in local dev env (#4737 ) Storing exception frame is extremely prone to circular refernece because it contains the reference to objects. When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem. I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.	2024-05-10 23:54:32 +09:00
Steve Grubb	dac6a3f6ed	[Misc] Apply a couple g++ cleanups (#4719 )	2024-05-10 13:37:05 +00:00
Kunshang Ji	64b77dfd7e	[Core]fix type annotation for `swap_blocks` (#4726 )	2024-05-10 21:52:48 +09:00
Simon Mo	51d4094fda	chunked-prefill-doc-syntax (#4603 ) Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by: sang <rkooo567@gmail.com>	2024-05-10 14:13:23 +09:00
Allen.Dou	e965d46184	[Misc] Keep only one implementation of the create_dummy_prompt function. (#4716 )	2024-05-09 21:42:38 -07:00
youkaichao	208b71bcc1	[Core][Distributed] refactor pynccl (#4591 ) [Core][Distributed] refactor pynccl to hold multiple communicators (#4591)	2024-05-09 19:48:43 -07:00
Cody Yu	c833101740	[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535 )	2024-05-09 18:04:17 -06:00
Philipp Moritz	379da6dcb5	[Kernel] [FP8] Improve FP8 linear layer performance (#4691 ) This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)). We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance. Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization: qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16) qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16) qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16) qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16) qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)	2024-05-09 16:38:07 -07:00
Hao Zhang	ebce310b74	[Model] Snowflake arctic model implementation (#4652 ) Co-authored-by: Dash Desai <1723932+iamontheinet@users.noreply.github.com> Co-authored-by: Aurick Qiao <qiao@aurick.net> Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-05-09 22:37:14 +00:00
Michael Goin	be0c5180ac	[Bugfix] Add logs for all model dtype casting (#4717 )	2024-05-09 18:36:25 +00:00
Robert Shaw	cea64430f6	[Bugfix] Update grafana.json (#4711 )	2024-05-09 10:10:13 -07:00
Cyrus Leung	a3c124570a	[Bugfix] Fix CLI arguments in OpenAI server docs (#4709 )	2024-05-09 09:53:14 -07:00

1 2 3 4 5 ...

1353 Commits