squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Avshalom Manevich	12a59959ed	[Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029 )	2024-07-01 21:08:29 +00:00
sroy745	80ca1e6a3a	[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348 )	2024-07-01 00:33:05 -07:00
youkaichao	614aa51203	[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007 )	2024-06-30 20:07:34 -07:00
Robert Shaw	af9ad46fca	[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-30 23:06:27 +00:00
SangBin Cho	f5e73c9f1b	[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909 ) Co-authored-by: sang <sangcho@anyscale.com>	2024-06-30 17:11:15 +00:00
llmpros	c6c240aa0a	[Frontend]: Support base64 embedding (#5935 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-06-30 23:53:00 +08:00
youkaichao	2be6955a3f	[ci][distributed] fix device count call [ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991)	2024-06-30 08:06:13 +00:00
Cyrus Leung	9d47f64eb6	[CI/Build] [3/3] Reorganize entrypoints tests (#5966 )	2024-06-30 12:58:49 +08:00
Cyrus Leung	cff6a1fec1	[CI/Build] Reuse code for checking output consistency (#5988 )	2024-06-30 11:44:25 +08:00
Matt Wong	9def10664e	[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (#5949 )	2024-06-29 12:47:58 -07:00
Cyrus Leung	99397da534	[CI/Build] Add TP test for vision models (#5892 )	2024-06-29 15:45:54 +00:00
Robert Shaw	8dbfcd35bf	[ CI/Build ] Added E2E Test For Compressed Tensors (#5839 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-29 21:12:58 +08:00
Cyrus Leung	51e971d39e	[Bugfix] Support `eos_token_id` from `config.json` (#5954 )	2024-06-29 11:19:02 +00:00
Woosuk Kwon	580353da93	[Bugfix] Fix precisions in Gemma 1 (#5913 )	2024-06-29 03:10:21 +00:00
Joe Runde	ba4994443a	[Kernel] Add punica dimensions for Granite 3b and 8b (#5930 ) Signed-off-by: Joe Runde <joe@joerun.de>	2024-06-29 10:48:25 +08:00
William Lin	906a19cdb0	[Misc] Extend vLLM Metrics logging API (#5925 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-29 10:36:06 +08:00
Lily Liu	7041de4384	[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>	2024-06-28 15:28:49 -07:00
Tyler Michael Smith	6a2d659d28	[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931 )	2024-06-28 17:10:34 +00:00
Cody Yu	b2c620230a	[Spec Decode] Introduce DraftModelRunner (#5799 )	2024-06-28 09:17:51 -07:00
xwjiang2010	b90d8cd832	[Distributed] Make it clear that % should not be in tensor dict keys. (#5927 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2024-06-28 15:20:22 +00:00
Cyrus Leung	3b752a6555	[CI/Build] [2/3] Reorganize entrypoints tests (#5904 )	2024-06-28 07:59:18 -07:00
Ilya Lavrenov	57f09a419c	[Hardware][Intel] OpenVINO vLLM backend (#5379 )	2024-06-28 13:50:16 +00:00
Cyrus Leung	5cbe8d155c	[Core] Registry for processing model inputs (#5214 ) Co-authored-by: ywang96 <ywang@roblox.com>	2024-06-28 12:09:56 +00:00
Roger Wang	736ed38849	[CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (#5922 )	2024-06-27 11:43:04 -07:00
Cyrus Leung	e9d32d077d	[CI/Build] [1/3] Reorganize entrypoints tests (#5526 )	2024-06-27 12:43:17 +00:00
xwjiang2010	d12af207d2	[VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly (#5880 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2024-06-27 15:15:24 +08:00
sasha0552	c54269d967	[Frontend] Add tokenize/detokenize endpoints (#5054 )	2024-06-26 16:54:22 +00:00
Luka Govedič	5bfd1bbc98	[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560 ) Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-06-26 15:16:00 +00:00
Cyrus Leung	6984c02a27	[CI/Build] Refactor image test assets (#5821 )	2024-06-26 01:02:34 -07:00
youkaichao	515080ad2f	[bugfix][distributed] fix shm broadcast when the queue size is full (#5801 )	2024-06-25 21:56:02 -07:00
Stephanie Wang	dda4811591	[Core] Refactor Worker and ModelRunner to consolidate control plane communication (#5408 ) Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Stephanie <swang@anyscale.com> Co-authored-by: Stephanie <swang@anyscale.com>	2024-06-25 20:30:03 -07:00
Thomas Parnell	c2a8ac75e0	[CI/Build] Add E2E tests for MLPSpeculator (#5791 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-06-26 00:04:08 +00:00
Matt Wong	dd793d1de5	[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes (#5422 )	2024-06-25 15:56:15 -07:00
Dipika Sikka	dd248f7675	[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (#5794 )	2024-06-25 19:23:35 +00:00
Michael Goin	d9b34baedd	[CI/Build] Add unit testing for FlexibleArgumentParser (#5798 )	2024-06-25 12:18:03 -07:00
Antoni Baum	67882dbb44	[Core] Add fault tolerance for `RayTokenizerGroupPool` (#5748 )	2024-06-25 10:15:10 -07:00
Woo-Yeon Lee	2ce5d6688b	[Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414 )	2024-06-25 09:56:06 +00:00
Isotr0py	edd5fe5fa2	[Bugfix] Add phi3v resize for dynamic shape and fix torchvision requirement (#5772 )	2024-06-24 12:11:53 +08:00
Murali Andoorveedu	5d4d90536f	[Distributed] Add send and recv helpers (#5719 )	2024-06-23 14:42:28 -07:00
rohithkrn	f5dda63eb5	[LoRA] Add support for pinning lora adapters in the LRU cache (#5603 )	2024-06-21 15:42:46 -07:00
youkaichao	d9a252bc8e	[Core][Distributed] add shm broadcast (#5399 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-06-21 05:12:35 +00:00
Jee Li	67005a07bc	[Bugfix] Add fully sharded layer for QKVParallelLinearWithLora (#5665 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-21 04:46:28 +00:00
Chang Su	c35e4a3dd7	[BugFix] Fix test_phi3v.py (#5725 )	2024-06-21 04:45:34 +00:00
Jinzhen Lin	1f5674218f	[Kernel] Add punica dimension for Qwen2 LoRA (#5441 )	2024-06-20 17:55:41 -07:00
Joshua Rosenkranz	b12518d3cf	[Model] MLPSpeculator speculative decoding support (#4947 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com> Co-authored-by: Davis Wertheimer <Davis.Wertheimer@ibm.com>	2024-06-20 20:23:12 -04:00
Michael Goin	8065a7e220	[Frontend] Add FlexibleArgumentParser to support both underscore and dash in names (#5718 )	2024-06-20 17:00:13 -06:00
Cyrus Leung	3730a1c832	[Misc] Improve conftest (#5681 )	2024-06-19 19:09:21 -07:00
Dipika Sikka	4a30d7e3cc	[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650 )	2024-06-19 18:06:44 -04:00
zifeitong	78687504f7	[Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654 )	2024-06-19 13:57:12 -07:00
youkaichao	d571ca0108	[ci][distributed] add tests for custom allreduce (#5689 )	2024-06-19 20:16:04 +00:00
Thomas Parnell	e5150f2c28	[Bugfix] Added test for sampling repetition penalty bug. (#5659 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-06-19 06:03:55 +00:00
sergey-tinkoff	07feecde1a	[Model] LoRA support added for command-r (#5178 )	2024-06-18 11:01:21 -07:00
Dipika Sikka	95db455e7f	[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542 )	2024-06-18 12:45:05 -04:00
Ronen Schaffer	7879f24dcc	[Misc] Add OpenTelemetry support (#4687 ) This PR adds basic support for OpenTelemetry distributed tracing. It includes changes to enable tracing functionality and improve monitoring capabilities. I've also added a markdown with print-screens to guide users how to use this feature. You can find it here	2024-06-19 01:17:03 +09:00
Roger Wang	4ad7b53e59	[CI/Build][Misc] Update Pytest Marker for VLMs (#5623 )	2024-06-18 13:10:04 +00:00
Joe Runde	5002175e80	[Kernel] Add punica dimensions for Granite 13b (#5559 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-06-18 03:54:11 +00:00
Isotr0py	daef218b55	[Model] Initialize Phi-3-vision support (#4986 )	2024-06-17 19:34:33 -07:00
sroy745	fa9e385229	[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131 )	2024-06-17 21:29:09 -05:00
Dipika Sikka	890d8d960b	[Kernel] `compressed-tensors` marlin 24 support (#5435 )	2024-06-17 12:32:48 -04:00
Michael Goin	4a6769053a	[CI][BugFix] Flip is_quant_method_supported condition (#5577 )	2024-06-16 14:07:34 +00:00
Alexander Matveev	d919ecc771	add gptq_marlin test for bug report https://github.com/vllm-project/vllm/issues/5088 (#5145 )	2024-06-15 13:38:16 -04:00
Cyrus Leung	81fbb3655f	[CI/Build] Test both text and token IDs in batched OpenAI Completions API (#5568 )	2024-06-15 07:29:42 -04:00
Cyrus Leung	0e9164b40a	[mypy] Enable type checking for test directory (#5017 )	2024-06-15 04:45:31 +00:00
leiwen83	1b8a0d71cf	[Core][Bugfix]: fix prefix caching for blockv2 (#5364 ) Signed-off-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-06-14 17:23:56 -07:00
youkaichao	48f589e18b	[mis] fix flaky test of test_cuda_device_count_stateless (#5546 )	2024-06-14 10:02:23 -07:00
Antoni Baum	50eed24d25	Add `cuda_device_count_stateless` (#5473 )	2024-06-13 16:06:49 -07:00
Tyler Michael Smith	33e3b37242	[CI/Build] Disable test_fp8.py (#5508 )	2024-06-13 13:37:48 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
Cyrus Leung	39873476f8	[CI/Build] Simplify OpenAI server setup in tests (#5100 )	2024-06-13 11:21:53 -07:00
Michael Goin	23ec72fa03	[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466 )	2024-06-13 15:18:08 +00:00
Dipika Sikka	c2637a613b	[Kernel] `w4a16` support for `compressed-tensors` (#5385 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 10:19:56 -04:00
youkaichao	ea3890a5f0	[Core][Distributed] code deduplication in tp&pp with coordinator(#5293 ) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)	2024-06-12 17:27:08 -07:00
Travis Johnson	51602eefd3	[Frontend] [Core] Support for sharded tensorized models (#4990 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-12 14:13:52 -07:00
Cody Yu	5985e3427d	[Kernel] Vectorized FP8 quantize kernel (#5396 ) Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.	2024-06-12 14:07:26 -07:00
SangBin Cho	847cdcca1c	[CI] Upgrade codespell version. (#5381 )	2024-06-12 10:06:14 -07:00
Simon Mo	e3c12bf6d2	Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463 )	2024-06-12 10:03:24 -07:00
Michael Goin	3dd6853bc8	[CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253 )	2024-06-12 09:58:02 -07:00
Nick Hill	99dac099ab	[Core][Doc] Default to multiprocessing for single-node distributed case (#5230 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-11 11:10:41 -07:00
youkaichao	c4bd03c7c5	[Core][Distributed] add same-node detection (#5369 )	2024-06-11 10:53:59 -07:00
sasha0552	dcbf4286af	[Frontend] Customizable RoPE theta (#5197 )	2024-06-11 10:42:26 -07:00
Cyrus Leung	640052b069	[Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026 )	2024-06-10 22:36:46 -07:00
maor-ps	351d5e7b82	[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-11 10:30:31 +08:00
Itay Etelis	774d1035e4	[Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (#5319 )	2024-06-10 14:22:09 +00:00
Cyrus Leung	6b29d6fe70	[Model] Initial support for LLaVA-NeXT (#4199 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-10 12:47:15 +00:00
Cyrus Leung	0bfa1c4f13	[Misc] Improve error message when LoRA parsing fails (#5194 )	2024-06-10 19:38:49 +08:00
Dipika Sikka	5884c2b454	[Misc] Update to comply with the new `compressed-tensors` config (#5350 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-06-10 03:49:46 +00:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
youkaichao	5d7e3d0176	[mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361 ) [mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361)	2024-06-09 03:50:14 +00:00
youkaichao	8ea5e44a43	[CI/Test] improve robustness of test (vllm_runner) (#5357 ) [CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357)	2024-06-08 08:59:20 +00:00
youkaichao	9fb900f90c	[CI/Test] improve robustness of test (hf_runner) (#5347 ) [CI/Test] improve robustness of test by replacing del with context manager (hf_runner) (#5347)	2024-06-07 22:31:32 -07:00
Roger Wang	7a9cb294ae	[Frontend] Add OpenAI Vision API Support (#5237 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-07 11:23:32 -07:00
Dipika Sikka	ca3ea51bde	[Kernel] Dynamic Per-Token Activation Quantization (#5037 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-07 09:36:26 -07:00
youkaichao	388596c914	[Misc][Utils] allow get_open_port to be called for multiple times (#5333 )	2024-06-06 22:15:11 -07:00
Itay Etelis	baa15a9ec3	[Feature][Frontend]: Add support for `stream_options` in `ChatCompletionRequest` (#5135 )	2024-06-07 03:29:24 +00:00
Antoni Baum	ccdc490dda	[Core] Change LoRA embedding sharding to support loading methods (#5038 )	2024-06-06 19:07:57 -07:00
Matthew Goldey	828da0d44e	[Frontend] enable passing multiple LoRA adapters at once to generate() (#5300 )	2024-06-06 15:48:13 -05:00
liuyhwangyh	4efff036f0	Bugfix: fix broken of download models from modelscope (#5233 ) Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>	2024-06-06 09:28:10 -07:00
Cyrus Leung	89c920785f	[CI/Build] Update vision tests (#5307 )	2024-06-06 05:17:18 -05:00
Breno Faria	7b0a0dfb22	[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#4109 ) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Breno Faria <breno.faria@intrafind.com>	2024-06-05 16:49:12 -07:00
Nick Hill	faf71bcd4b	[Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252 )	2024-06-05 14:53:05 -07:00

1 2 3 4 5 ...

497 Commits