squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Isotr0py	2135cacb45	[Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451 )	2024-06-12 16:20:18 -07:00
Michael Goin	7d19de2e9c	[Frontend] Add "input speed" to tqdm postfix alongside output speed (#5425 )	2024-06-12 18:42:12 -04:00
Michael Goin	94a07bbdd8	[Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470 )	2024-06-12 21:59:44 +00:00
Cyrus Leung	b8d4dfff9c	[Doc] Update debug docs (#5438 )	2024-06-12 14:49:31 -07:00
youkaichao	622d45128c	[misc] add hint for AttributeError (#5462 )	2024-06-12 21:46:35 +00:00
Travis Johnson	51602eefd3	[Frontend] [Core] Support for sharded tensorized models (#4990 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-12 14:13:52 -07:00
Arthur Kim	5cc50a531f	[Bugfix] TYPE_CHECKING for MultiModalData (#5444 )	2024-06-12 14:08:52 -07:00
Cody Yu	5985e3427d	[Kernel] Vectorized FP8 quantize kernel (#5396 ) Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.	2024-06-12 14:07:26 -07:00
Kevin H. Luu	8b82a89997	[ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests (#5464 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-12 14:00:18 -07:00
Li, Jiang	c3c2903e72	[Bugfix] Add device assertion to TorchSDPA (#5402 )	2024-06-12 12:58:53 -07:00
Woosuk Kwon	1a8bfd92d5	[Hardware] Initial TPU integration (#5292 )	2024-06-12 11:53:03 -07:00
SangBin Cho	847cdcca1c	[CI] Upgrade codespell version. (#5381 )	2024-06-12 10:06:14 -07:00
Simon Mo	e3c12bf6d2	Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463 )	2024-06-12 10:03:24 -07:00
Michael Goin	3dd6853bc8	[CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253 )	2024-06-12 09:58:02 -07:00
youkaichao	8f89d72090	[Doc] add common case for long waiting time (#5430 )	2024-06-11 11:12:13 -07:00
Nick Hill	99dac099ab	[Core][Doc] Default to multiprocessing for single-node distributed case (#5230 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-11 11:10:41 -07:00
youkaichao	c4bd03c7c5	[Core][Distributed] add same-node detection (#5369 )	2024-06-11 10:53:59 -07:00
sasha0552	dcbf4286af	[Frontend] Customizable RoPE theta (#5197 )	2024-06-11 10:42:26 -07:00
Ali Panahi	00e6a2dc53	[Bugfix] fix lora_dtype value type in arg_utils.py (#5398 )	2024-06-11 10:40:23 -07:00
Junichi Sato	2e02311a1b	[Bugfix] Fix `MultiprocessingGPUExecutor.check_health` when world_size == 1 (#5254 )	2024-06-11 10:38:07 -07:00
Cade Daniel	89ec06c33b	[Docs] [Spec decode] Fix docs error in code example (#5427 )	2024-06-11 10:31:56 -07:00
Kuntai Du	9fde251bf0	[Doc] Add an automatic prefix caching section in vllm documentation (#5324 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-06-11 10:24:59 -07:00
Cade Daniel	4c2ffb28ff	[Speculative decoding] Initial spec decode docs (#5400 )	2024-06-11 10:15:40 -07:00
SangBin Cho	246598a6b1	[CI] docfix (#5410 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: ywang96 <ywang@roblox.com>	2024-06-11 01:28:50 -07:00
Woosuk Kwon	8bab4959be	[Misc] Remove VLLM_BUILD_WITH_NEURON env variable (#5389 )	2024-06-11 00:37:56 -07:00
Roger Wang	3c4cebf751	[Doc][Typo] Fixing Missing Comma (#5403 )	2024-06-11 00:20:28 -07:00
youkaichao	d8f31f2f8b	[Doc] add debugging tips (#5409 )	2024-06-10 23:21:43 -07:00
Cyrus Leung	640052b069	[Bugfix][Frontend] Cleanup "fix chat logprobs" (#5026 )	2024-06-10 22:36:46 -07:00
maor-ps	351d5e7b82	[Bugfix] OpenAI entrypoint limits logprobs while ignoring server defined --max-logprobs (#5312 ) Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-06-11 10:30:31 +08:00
Nick Hill	a008629807	[Misc] Various simplifications and typing fixes (#5368 )	2024-06-11 10:29:02 +08:00
Kevin H. Luu	76477a93b7	[ci] Fix Buildkite agent path (#5392 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-10 18:58:07 -07:00
Michael Goin	77c87beb06	[Doc] Add documentation for FP8 W8A8 (#5388 )	2024-06-10 18:55:12 -06:00
Simon Mo	114332b88e	Bump version to v0.5.0 (#5384 )	2024-06-10 15:56:06 -07:00
Woosuk Kwon	cb77ad836f	[Docs] Alphabetically sort sponsors (#5386 )	2024-06-10 15:17:19 -05:00
Roger Wang	856c990041	[Docs] Add Docs on Limitations of VLM Support (#5383 )	2024-06-10 09:53:50 -07:00
Kevin H. Luu	c5602f0baa	[ci] Mount buildkite agent on Docker container to upload benchmark results (#5330 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-10 09:22:34 -07:00
Kevin H. Luu	f7f9c5f97b	[ci] Use small_cpu_queue for doc build (#5331 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-10 09:21:11 -07:00
Cyrus Leung	2c0d933594	[Bugfix] Fix LLaVA-NeXT (#5380 )	2024-06-10 15:38:47 +00:00
Itay Etelis	774d1035e4	[Feature][Frontend]: Continued `stream_options` implementation also in CompletionRequest (#5319 )	2024-06-10 14:22:09 +00:00
Cyrus Leung	6b29d6fe70	[Model] Initial support for LLaVA-NeXT (#4199 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-10 12:47:15 +00:00
Cyrus Leung	0bfa1c4f13	[Misc] Improve error message when LoRA parsing fails (#5194 )	2024-06-10 19:38:49 +08:00
youkaichao	c81da5f56d	[misc][typo] fix typo (#5372 )	2024-06-10 09:51:02 +00:00
Roger Wang	68bc81703e	[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API Server (#5374 )	2024-06-10 09:13:39 +00:00
Dipika Sikka	5884c2b454	[Misc] Update to comply with the new `compressed-tensors` config (#5350 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-06-10 03:49:46 +00:00
Bla_ckB	45f92c00cf	[Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164 )	2024-06-09 16:23:14 -07:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
youkaichao	5d7e3d0176	[mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361 ) [mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (#5361)	2024-06-09 03:50:14 +00:00
youkaichao	0373e1837e	[Core][CUDA Graph] add output buffer for cudagraph (#5074 ) [Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (#5074)	2024-06-08 19:14:43 -07:00
Michael Goin	c09dade2a2	[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353 )	2024-06-08 13:54:05 -04:00
youkaichao	8ea5e44a43	[CI/Test] improve robustness of test (vllm_runner) (#5357 ) [CI/Test] improve robustness of test by replacing del with context manager (vllm_runner) (#5357)	2024-06-08 08:59:20 +00:00

... 2 3 4 5 6 ...

1718 Commits