squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
daquexian	99ded1e1c4	[Doc] Remove comments incorrectly copied from another project (#6286 )	2024-07-10 17:05:26 -07:00
Baoyuan Qi	d3a245138a	[Bugfix]fix and needs_scalar_to_array logic check (#6238 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-07-09 23:43:24 +00:00
Avshalom Manevich	f7a8fa39d8	[Kernel] reloading fused_moe config on the last chunk (#6210 )	2024-07-08 08:00:38 -07:00
Robert Shaw	abfe705a02	[ Misc ] Support Fp8 via `llm-compressor` (#6110 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-07-07 20:42:11 +00:00
Robert Shaw	62963d129e	[ Misc ] Clean Up `CompressedTensorsW8A8` (#6113 )	2024-07-03 22:50:08 +00:00
Michael Goin	47f0954af0	[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975 )	2024-07-03 17:38:00 +00:00
youkaichao	482045ee77	[hardware][misc] introduce platform abstraction (#6080 )	2024-07-02 20:12:22 -07:00
Qubitium-ModelCloud	ee93f4f92a	[CORE] Quantized lm-head Framework (#4442 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com> Co-authored-by: ZX <zx@lbx.dev>	2024-07-02 22:25:17 +00:00
Robert Shaw	7c008c51a9	[ Misc ] Refactor MoE to isolate Fp8 From Mixtral (#5970 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-07-02 21:54:35 +00:00
Avshalom Manevich	12a59959ed	[Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029 )	2024-07-01 21:08:29 +00:00
sroy745	80ca1e6a3a	[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348 )	2024-07-01 00:33:05 -07:00
youkaichao	614aa51203	[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007 )	2024-06-30 20:07:34 -07:00
Robert Shaw	af9ad46fca	[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-30 23:06:27 +00:00
Dipika Sikka	7836fdcc11	[Misc] Fix `get_min_capability` (#5971 )	2024-06-30 20:15:16 +00:00
Robert Shaw	8dbfcd35bf	[ CI/Build ] Added E2E Test For Compressed Tensors (#5839 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-29 21:12:58 +08:00
Cody Yu	f7dac83d95	[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (#5939 )	2024-06-29 21:04:20 +08:00
wangding zeng	be0b3af9e0	Support Deepseek-V2 (#4650 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2024-06-28 13:24:57 -07:00
Robert Shaw	2cd402e169	[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-28 18:43:49 +00:00
Robert Shaw	b185230744	[ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) (#5928 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-28 13:49:57 -04:00
Ilya Lavrenov	57f09a419c	[Hardware][Intel] OpenVINO vLLM backend (#5379 )	2024-06-28 13:50:16 +00:00
Tyler Michael Smith	5932634409	Unmark fused_moe config json file as executable (#5960 )	2024-06-28 06:36:12 -07:00
Divakar Verma	c3dde367f1	[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932 )	2024-06-27 13:41:08 -07:00
Woosuk Kwon	79c92c7c8a	[Model] Add Gemma 2 (#5908 )	2024-06-27 13:33:56 -07:00
Woosuk Kwon	6806998bf9	[Bugfix] Fix embedding to support 2D inputs (#5829 )	2024-06-26 00:15:22 -07:00
Dipika Sikka	dd248f7675	[Misc] Update `w4a16` `compressed-tensors` support to include `w8a16` (#5794 )	2024-06-25 19:23:35 +00:00
Roger Wang	bd620b01fb	[Kernel][CPU] Add Quick `gelu` to CPU (#5717 )	2024-06-21 06:39:40 +00:00
Tyler Michael Smith	3f3b6b2150	[Bugfix] Fix the CUDA version check for FP8 support in the CUTLASS kernels (#5715 )	2024-06-20 18:36:10 +00:00
Roger Wang	ad137cd111	[Model] Port over CLIPVisionModel for VLMs (#5591 )	2024-06-20 11:52:09 +00:00
Dipika Sikka	4a30d7e3cc	[Misc] Add per channel support for static activation quantization; update w8a8 schemes to share base classes (#5650 )	2024-06-19 18:06:44 -04:00
Shukant Pal	59a1eb59c9	[Bugfix] Fix Phi-3 Long RoPE scaling implementation (#5628 )	2024-06-19 01:46:38 +00:00
Dipika Sikka	95db455e7f	[Misc] Add channel-wise quantization support for w8a8 dynamic per token activation quantization (#5542 )	2024-06-18 12:45:05 -04:00
sroy745	fa9e385229	[Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier (#5131 )	2024-06-17 21:29:09 -05:00
Kunshang Ji	728c4c8a06	[Hardware][Intel GPU] Add Intel GPU(XPU) inference backend (#3814 ) Co-authored-by: Jiang Li <jiang1.li@intel.com> Co-authored-by: Abhilash Majumder <abhilash.majumder@intel.com> Co-authored-by: Abhilash Majumder <30946547+abhilash1910@users.noreply.github.com>	2024-06-17 11:01:25 -07:00
Dipika Sikka	890d8d960b	[Kernel] `compressed-tensors` marlin 24 support (#5435 )	2024-06-17 12:32:48 -04:00
Amit Garg	9333fb8eb9	[Model] Rename Phi3 rope scaling type (#5595 )	2024-06-17 12:04:14 -04:00
Cyrus Leung	0e9164b40a	[mypy] Enable type checking for test directory (#5017 )	2024-06-15 04:45:31 +00:00
Robert Shaw	15985680e2	[ Misc ] Rs/compressed tensors cleanup (#5432 ) Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>	2024-06-14 10:01:46 -07:00
Tyler Michael Smith	703475f6c2	[Kernel] Fix CUTLASS 3.x custom broadcast load epilogue (#5516 )	2024-06-14 09:30:15 -07:00
Tyler Michael Smith	e38042d4af	[Kernel] Disable CUTLASS kernels for fp8 (#5505 )	2024-06-13 13:38:05 -07:00
Tyler Michael Smith	85657b5607	[Kernel] Factor out epilogues from cutlass kernels (#5391 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Co-authored-by: zifeitong <zifei.tong@parasail.io> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 11:22:19 -07:00
wenyujin333	bd43973522	[Kernel] Tune Qwen2MoE kernel configurations with tp2,4 (#5497 ) Tune Qwen2-57B-A14B configs based on #4921 Throughput Performance command: python benchmarks/benchmark_throughput.py --model=Qwen/Qwen2-57B-A14B-Instruct --input-len 1000 --output-len 50 -tp 2 A100 GPU benchmark no config w/ PR tp=2 10.53 requests/s, 11058.17 tokens/s 12.47 requests/s, 13088.57 tokens/s tp=4 17.77 requests/s, 18662.95 tokens/s 20.20 requests/s, 21212.32 tokens/s	2024-06-13 09:01:10 -07:00
Dipika Sikka	c2637a613b	[Kernel] `w4a16` support for `compressed-tensors` (#5385 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-06-13 10:19:56 -04:00
Woosuk Kwon	1a8bfd92d5	[Hardware] Initial TPU integration (#5292 )	2024-06-12 11:53:03 -07:00
Nick Hill	a008629807	[Misc] Various simplifications and typing fixes (#5368 )	2024-06-11 10:29:02 +08:00
Dipika Sikka	5884c2b454	[Misc] Update to comply with the new `compressed-tensors` config (#5350 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-06-10 03:49:46 +00:00
bnellnm	5467ac3196	[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custom ops (#5047 )	2024-06-09 16:23:30 -04:00
Michael Goin	c09dade2a2	[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input_scale (#5353 )	2024-06-08 13:54:05 -04:00
Cheng Li	e69ded7d1c	[Bug Fix] Fix the support check for FP8 CUTLASS (#5352 ) Bug description: With torch 2.4.0.dev20240603+cu121, cutlass_fp8_supported outputs False, and the (capability, version) before the comparison is (90, 11111111112) This PR fixes the support check for FP8 CUTLASS ( cutlass_fp8_supported) which was introduced in https://github.com/vllm-project/vllm/pull/5183.	2024-06-08 00:42:05 +00:00
Dipika Sikka	ca3ea51bde	[Kernel] Dynamic Per-Token Activation Quantization (#5037 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-06-07 09:36:26 -07:00
Tyler Michael Smith	8d75fe48ca	[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183 ) Switching from torch._scaled_mm to vLLM's cutlass fp8 kernels when supported as we are seeing 5-15% improvement in e2e performance on neuralmagic/Meta-Llama-3-8B-Instruct-FP8 see https://docs.google.com/spreadsheets/d/1GiAnmzyGHgZ6zL_LDSTm35Bdrt4A8AaFEurDlISYYA4/ for some quick e2e benchmarks and #5144 for comparisons across different GEMM sizes.	2024-06-07 08:42:35 +00:00

1 2 3 4 5

243 Commits