squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
youkaichao	f5bb85b435	[Core][Distributed] improve p2p cache generation (#5528 )	2024-06-14 14:47:45 -07:00
Woosuk Kwon	28c145eb57	[Bugfix] Fix typo in Pallas backend (#5558 )	2024-06-14 14:40:09 -07:00
Thomas Parnell	e2afb03c92	[Bugfix] Enable loading FP8 checkpoints for gpt_bigcode models (#5460 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-06-14 20:28:11 +00:00
Sanger Steel	6e2527a7cb	[Doc] Update documentation on Tensorizer (#5471 )	2024-06-14 11:27:57 -07:00
Simon Mo	cdab68dcdb	[Docs] Add ZhenFund as a Sponsor (#5548 )	2024-06-14 11:17:21 -07:00
youkaichao	d1c3d7d139	[misc][distributed] fix benign error in `is_in_the_same_node` (#5512 )	2024-06-14 10:59:28 -07:00
Cyrus Leung	77490c6f2f	[Core] Remove duplicate processing in async engine (#5525 )	2024-06-14 10:04:42 -07:00
youkaichao	48f589e18b	[mis] fix flaky test of test_cuda_device_count_stateless (#5546 )	2024-06-14 10:02:23 -07:00
Tyler Michael Smith	348616ac4b	[Kernel] Suppress mma.sp warning on CUDA 12.5 and later (#5401 )	2024-06-14 10:02:00 -07:00
Robert Shaw	15985680e2	[ Misc ] Rs/compressed tensors cleanup (#5432 ) Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>	2024-06-14 10:01:46 -07:00
Allen.Dou	d74674bbd9	[Misc] Fix arg names (#5524 )	2024-06-14 09:47:44 -07:00
Tyler Michael Smith	703475f6c2	[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516 )	2024-06-14 09:30:15 -07:00
Cyrus Leung	d47af2bc02	[CI/Build] Disable LLaVA-NeXT CPU test (#5529 )	2024-06-14 09:27:30 -07:00
Kuntai Du	319ad7f1d3	[CI/Build][Misc] Add CI that benchmarks vllm performance on those PRs with `perf-benchmarks` label (#5073 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-06-13 22:36:20 -07:00
Simon Mo	0f0d8bc065	bump version to v0.5.0.post1 (#5522 )	2024-06-13 19:42:06 -07:00
Allen.Dou	55d6361b13	[Misc] Fix arg names in quantizer script (#5507 )	2024-06-13 19:02:53 -07:00
Jie Fu (傅杰)	cd9c0d65d9	[Hardware][Intel] Support CPU inference with AVX2 ISA (#5452 )	2024-06-13 17:22:24 -06:00
Antoni Baum	50eed24d25	Add `cuda_device_count_stateless` (#5473 )	2024-06-13 16:06:49 -07:00
Tyler Michael Smith	e38042d4af	[Kernel] Disable CUTLASS kernels for fp8 (#5505 )	2024-06-13 13:38:05 -07:00
Tyler Michael Smith	33e3b37242	[CI/Build] Disable test_fp8.py (#5508 )	2024-06-13 13:37:48 -07:00
youkaichao	1696efe6c9	[misc] fix format.sh (#5511 )	2024-06-13 12:09:16 -07:00
Antoni Baum	6b0511a57b	Revert "[Core] Remove unnecessary copies in flash attn backend" (#5478 )	2024-06-13 11:22:50 -07:00
Antoni Baum	a8fda4f661	Seperate dev requirements into lint and test (#5474 )	2024-06-13 11:22:41 -07:00
Cody Yu	30299a41fa	[MISC] Remove FP8 warning (#5472 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2024-06-13 11:22:30 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
Cyrus Leung	0ce7b952f8	[Doc] Update LLaVA docs (#5437 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-13 11:22:07 -07:00
Cyrus Leung	39873476f8	[CI/Build] Simplify OpenAI server setup in tests (#5100 )	2024-06-13 11:21:53 -07:00
Cyrus Leung	03dccc886e	[Misc] Add vLLM version getter to utils (#5098 )	2024-06-13 11:21:39 -07:00
Woosuk Kwon	a65634d3ae	[Docs] Add 4th meetup slides (#5509 )	2024-06-13 10:18:26 -07:00
Li, Jiang	80aa7e91fc	[Hardware][Intel] Optimize CPU backend and add more performance tips (#4971 ) Co-authored-by: Jianan Gu <jianan.gu@intel.com>	2024-06-13 09:33:14 -07:00
wenyujin333	bd43973522	[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497 ) Tune Qwen2-57B-A14B configs based on #4921 Throughput Performance command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2 A100 GPU benchmark no config w/ PR tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s	2024-06-13 09:01:10 -07:00
Michael Goin	23ec72fa03	[CI/Build][REDO] Add is_quant_method_supported to control quantization test configurations (#5466 )	2024-06-13 15:18:08 +00:00
Dipika Sikka	c2637a613b	[Kernel] `w4a16` support for `compressed-tensors` (#5385 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 10:19:56 -04:00
Wang, Yi	88407532e7	[Bugfix]if the content is started with ":"(response of ping), client should i… (#5303 ) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-12 20:16:41 -07:00
Kevin H. Luu	916d219d62	[ci] Use sccache to build images (#5419 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-12 17:58:12 -07:00
youkaichao	ea3890a5f0	[Core][Distributed] code deduplication in tp&pp with coordinator(#5293 ) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (#5293)	2024-06-12 17:27:08 -07:00
Isotr0py	2135cacb45	[Bugfix] Fix wrong multi_modal_input format for CPU runner (#5451 )	2024-06-12 16:20:18 -07:00
Michael Goin	7d19de2e9c	[Frontend] Add "input speed" to tqdm postfix alongside output speed (#5425 )	2024-06-12 18:42:12 -04:00
Michael Goin	94a07bbdd8	[Bugfix] Fix typo in scheduler.py (requeset -> request) (#5470 )	2024-06-12 21:59:44 +00:00
Cyrus Leung	b8d4dfff9c	[Doc] Update debug docs (#5438 )	2024-06-12 14:49:31 -07:00
youkaichao	622d45128c	[misc] add hint for AttributeError (#5462 )	2024-06-12 21:46:35 +00:00
Travis Johnson	51602eefd3	[Frontend] [Core] Support for sharded tensorized models (#4990 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Sanger Steel <sangersteel@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-12 14:13:52 -07:00
Arthur Kim	5cc50a531f	[Bugfix] TYPE_CHECKING for MultiModalData (#5444 )	2024-06-12 14:08:52 -07:00
Cody Yu	5985e3427d	[Kernel] Vectorized FP8 quantize kernel (#5396 ) Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.	2024-06-12 14:07:26 -07:00
Kevin H. Luu	8b82a89997	[ci] Add AMD, Neuron, Intel tests for AWS CI and turn off default soft fail for GPU tests (#5464 ) Signed-off-by: kevin <kevin@anyscale.com>	2024-06-12 14:00:18 -07:00
Li, Jiang	c3c2903e72	[Bugfix] Add device assertion to TorchSDPA (#5402 )	2024-06-12 12:58:53 -07:00
Woosuk Kwon	1a8bfd92d5	[Hardware] Initial TPU integration (#5292 )	2024-06-12 11:53:03 -07:00
SangBin Cho	847cdcca1c	[CI] Upgrade codespell version. (#5381 )	2024-06-12 10:06:14 -07:00
Simon Mo	e3c12bf6d2	Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463 )	2024-06-12 10:03:24 -07:00
Michael Goin	3dd6853bc8	[CI/Build] Add `is_quant_method_supported` to control quantization test configurations (#5253 )	2024-06-12 09:58:02 -07:00

... 3 4 5 6 7 ...

1804 Commits