squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
SangBin Cho	3521ba4f25	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
alexm-nm	7038e8b803	[Kernel] Support running GPTQ 8-bit models in Marlin (#4533 )	2024-05-02 12:56:22 -04:00
Robert Shaw	73c8d677e5	[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922 ) Co-authored-by: alexm <alexm@neuralmagic.com> Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-29 09:35:34 -07:00
Austin Veselka	eefeb16464	[Kernel] Full Tensor Parallelism for LoRA Layers (#3524 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-04-27 00:03:48 -07:00
Philipp Moritz	12628d3c78	[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales (#4343 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-27 04:49:59 +00:00
alexm-nm	aae08249ac	[Bugfix] Fix marlin kernel crash on H100 (#4218 ) This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187. The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.	2024-04-24 10:35:01 -07:00
Woosuk Kwon	468d761b32	[Misc] Reduce supported Punica dtypes (#4304 )	2024-04-23 18:54:33 -07:00
Philipp Moritz	eace8bf0b9	[Kernel] FP8 support for MoE kernel / Mixtral (#4244 ) This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208 It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this: ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Performance: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows: <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03"> Accuracy: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows: ``` \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7018\|± \|0.0036\| \| - humanities \|N/A \|none \| 5\|acc \|0.6472\|± \|0.0065\| \| - other \|N/A \|none \| 5\|acc \|0.7673\|± \|0.0072\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8099\|± \|0.0070\| \| - stem \|N/A \|none \| 5\|acc \|0.6131\|± \|0.0083\| ``` this compares favorably with the fp16 results which are ``` \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7020\|± \|0.1313\| \| - humanities \|N/A \|none \| 5\|acc \|0.6425\|± \|0.1349\| \| - other \|N/A \|none \| 5\|acc \|0.7744\|± \|0.1038\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8131\|± \|0.0695\| \| - stem \|N/A \|none \| 5\|acc \|0.6108\|± \|0.1383\| ``` Happy hacking!	2024-04-24 01:18:23 +00:00
James Fleming	2b7949c1c2	AQLM CUDA support (#3287 ) Co-authored-by: mgoin <michael@neuralmagic.com>	2024-04-23 13:59:33 -04:00
Shoichi Uchinami	a53222544c	[Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134 )	2024-04-17 10:02:45 -07:00
Jee Li	989ae2538d	[Kernel] Add punica dimension for Baichuan-13B (#4053 )	2024-04-13 07:55:05 -07:00
Antoni Baum	1e96c3341a	Add extra punica sizes to support bigger vocabs (#4015 )	2024-04-11 22:18:57 +00:00
Antoni Baum	a10d3056da	[Core] Set `linear_weights` directly on the layer (#3977 )	2024-04-11 16:35:51 -04:00
fuchen.ljl	08ccee1e83	punica fix-bgmv-kernel-640 (#4007 )	2024-04-11 08:59:26 -07:00
Matt Wong	59a6abf3c9	[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations (#3782 )	2024-04-08 14:31:02 -07:00
Woosuk Kwon	498eb5cfa3	[Bugfix] Add kv_scale input parameter to CPU backend (#3840 )	2024-04-04 04:33:08 +00:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
bigPYJ1151	0e3f06fe9c	[Hardware][Intel] Add CPU inference backend (#3634 ) Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>	2024-04-01 22:07:30 -07:00
mawong-amd	b6d103542c	[Kernel] Layernorm performance optimization (#3662 )	2024-03-30 14:26:38 -07:00
Jee Li	566b57c5c4	[Kernel] support non-zero cuda devices in punica kernels (#3636 )	2024-03-27 00:37:42 +00:00
Jee Li	8af890a865	Enable more models to inference based on LoRA (#3382 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-03-25 18:09:31 -07:00
Hanzhi Zhou	f721096d48	[BugFix] Some fixes for custom allreduce kernels (#2760 )	2024-03-21 23:02:58 -07:00
Woosuk Kwon	9101d832e6	[Bugfix] Make moe_align_block_size AMD-compatible (#3470 )	2024-03-18 11:26:24 -07:00
Simon Mo	8e67598aa6	[Misc] fix line length for entire codebase (#3444 )	2024-03-16 00:36:29 -07:00
akhoroshev	78b6c4845a	Dynamically configure shared memory size for moe_align_block_size_kernel (#3376 )	2024-03-14 18:18:07 -07:00
Terry	7e9bd08f60	Add batched RoPE kernel (#3095 )	2024-03-13 13:45:26 -07:00
Or Sharir	ae0ccb4017	Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350 )	2024-03-13 12:18:25 -07:00
Woosuk Kwon	602358f8a8	Add kernel for GeGLU with approximate GELU (#3337 )	2024-03-12 22:06:17 -07:00
kliuae	c9415c19d3	[ROCm] Fix warp and lane calculation in blockReduceSum (#3321 )	2024-03-11 13:14:07 -07:00
Douglas Lehr	e4a28e5316	[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262 )	2024-03-10 15:27:45 -07:00
Terry	0bba88df03	Enhance lora tests with more layer and rank variations (#3243 )	2024-03-09 17:14:16 -08:00
whyiug	c59e120c55	Feature add lora support for Qwen2 (#3177 )	2024-03-07 21:58:24 -08:00
Robert Shaw	c0c2335ce0	Integrate Marlin Kernels for Int4 GPTQ inference (#2497 ) Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>	2024-03-01 12:47:51 -08:00
CHU Tianxiang	01a5d18a53	Add Support for 2/3/8-bit GPTQ Quantization Models (#2330 )	2024-02-28 21:52:23 -08:00
Woosuk Kwon	929b4f2973	Add LoRA support for Gemma (#3050 )	2024-02-28 13:03:28 -08:00
Woosuk Kwon	d6e4a130b0	[Minor] Remove gather_cached_kv kernel (#3043 )	2024-02-26 15:00:54 -08:00
Woosuk Kwon	fd5dcc5c81	Optimize GeGLU layer in Gemma (#2975 )	2024-02-21 20:17:52 -08:00
Rex	563836496a	Refactor 2 awq gemm kernels into m16nXk32 (#2723 ) Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>	2024-02-12 11:02:17 -08:00
Woosuk Kwon	f0d4e14557	Add fused top-K softmax kernel for MoE (#2769 )	2024-02-05 17:38:02 -08:00
zhaoyang-star	923797fea4	Fix compile error when using rocm (#2648 )	2024-02-01 09:35:09 -08:00
Philipp Moritz	ab40644669	Fused MOE for Mixtral (#2542 ) Co-authored-by: chen shen <scv119@gmail.com>	2024-01-29 22:43:37 -08:00
wangding zeng	5d60def02c	DeepseekMoE support with Fused MoE kernel (#2453 ) Co-authored-by: roy <jasonailu87@gmail.com>	2024-01-29 21:19:48 -08:00
Hanzhi Zhou	1b20639a43	No repeated IPC open (#2642 )	2024-01-29 10:46:29 -08:00
zhaoyang-star	9090bf02e7	Support FP8-E5M2 KV Cache (#2279 ) Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-01-28 16:43:54 -08:00
Woosuk Kwon	f8ecb84c02	Speed up Punica compilation (#2632 )	2024-01-27 17:46:56 -08:00
Hanzhi Zhou	380170038e	Implement custom all reduce kernels (#2192 )	2024-01-27 12:46:35 -08:00
Casper	beb89f68b4	AWQ: Up to 2.66x higher throughput (#2566 )	2024-01-26 23:53:17 -08:00
Hongxia Yang	6b7de1a030	[ROCm] add support to ROCm 6.0 and MI300 (#2274 )	2024-01-26 12:41:10 -08:00
Vladimir	5265631d15	use a correct device when creating OptionalCUDAGuard (#2583 )	2024-01-25 23:48:17 -08:00
Antoni Baum	9b945daaf1	[Experimental] Add multi-LoRA support (#1804 ) Co-authored-by: Chen Shen <scv119@gmail.com> Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com> Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-01-23 15:26:37 -08:00
Woosuk Kwon	6ef00b03a2	Enable CUDA graph for GPTQ & SqueezeLLM (#2318 )	2024-01-03 09:52:29 -08:00
Jee Li	77af974b40	[FIX] Support non-zero CUDA devices in custom kernels (#1959 )	2024-01-02 19:09:59 -08:00
kliuae	1b7c791d60	[ROCm] Fixes for GPTQ on ROCm (#2180 )	2023-12-18 10:41:04 -08:00
Woosuk Kwon	76a7983b23	[BugFix] Fix RoPE kernel on long sequences(#2164 )	2023-12-17 17:09:10 -08:00
CHU Tianxiang	0fbfc4b81b	Add GPTQ support (#916 )	2023-12-15 03:04:22 -08:00
Mingcan Xiang	614856da25	Avoid multiple redefinition (#1817 )	2023-12-14 09:35:58 -08:00
wbn	dacaf5a400	Replace head_mapping params with num_kv_heads to attention kernel. (#1997 ) Co-authored-by: wangguoya <wangguoya@baidu.com> Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>	2023-12-10 10:12:53 -08:00
TJian	6ccc0bfffb	Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Amir Balwel <amoooori04@gmail.com> Co-authored-by: root <kuanfu.liu@akirakan.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com> Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>	2023-12-07 23:16:52 -08:00
Yanming W	e0c6f556e8	[Build] Avoid building too many extensions (#1624 )	2023-11-23 16:31:19 -08:00
ljss	e1054247ba	[Optimization] Implement fused add rmsnorm (#1667 )	2023-11-18 18:18:02 -08:00
Antoni Baum	9f669a9a7c	Support YaRN models (#1264 ) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2023-11-03 14:12:48 -07:00
Woosuk Kwon	0ce8647dc5	Fix integer overflows in attention & cache ops (#1514 )	2023-10-31 15:19:30 -07:00
chooper1	1f24755bf8	Support SqueezeLLM (#1326 ) Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2023-10-21 23:14:59 -07:00
Woosuk Kwon	c1376e0f82	Change scheduler & input tensor shape (#1381 )	2023-10-16 17:48:42 -07:00
Woosuk Kwon	928de46888	Implement PagedAttention V2 (#1348 )	2023-10-16 00:59:57 -07:00
Woosuk Kwon	29678cd213	Minor fix on AWQ kernel launch (#1356 )	2023-10-15 21:53:56 -07:00
CHU Tianxiang	980dd4a2c4	Fix overflow in awq kernel (#1295 ) Co-authored-by: 楚天翔 <tianxiang.ctx@alibaba-inc.com>	2023-10-11 00:19:53 -07:00
twaka	8285736840	workaround of AWQ for Turing GPUs (#1252 )	2023-10-10 19:48:16 -07:00
Liang	ebe4d1db3a	Fix boundary check in paged attention kernel (#1241 )	2023-10-01 11:35:06 -07:00
Antoni Baum	cf5cb1e33e	Allocate more shared memory to attention kernel (#1154 )	2023-09-26 22:27:13 -07:00
Woosuk Kwon	2b1c116b5a	Add minimum capability requirement for AWQ (#1064 )	2023-09-18 12:02:01 -07:00
Woosuk Kwon	e3e79e9e8a	Implement AWQ quantization support for LLaMA (#1032 ) Co-authored-by: Robert Irvine <robert@seamlessml.com> Co-authored-by: root <rirv938@gmail.com> Co-authored-by: Casper <casperbh.96@gmail.com> Co-authored-by: julian-q <julianhquevedo@gmail.com>	2023-09-16 00:03:37 -07:00
Zhuohan Li	db09d4ad83	[FIX] Fix Alibi implementation in PagedAttention kernel (#945 ) * [FIX] Fix Alibi implementation in PagedAttention kernel * Fix test_attention * Fix --------- Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Oliver-ss <yuansongwx@outlook.com>	2023-09-07 15:53:14 -07:00
Woosuk Kwon	320a622ec4	[BugFix] Implement RoPE for GPT-J (#941 )	2023-09-06 11:54:33 +09:00
Woosuk Kwon	bf87484efa	[BugFix] Fix NaN errors in paged attention kernel (#936 )	2023-09-04 09:20:06 +09:00
Woosuk Kwon	8ce9c50d40	Avoid compiling kernels for double data type (#933 )	2023-09-02 14:59:47 +09:00
Woosuk Kwon	d64bf1646c	Implement approximate GELU kernels (#828 )	2023-08-23 07:43:21 +09:00
Dean Leitersdorf	79af7e96a0	[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel (#420 )	2023-08-04 10:57:29 -07:00
Zhuohan Li	1b0bd0fe8a	Add Falcon support (new) (#592 )	2023-08-02 14:04:39 -07:00
Zhuohan Li	6fc2a38b11	Add support for LLaMA-2 (#505 )	2023-07-20 11:38:27 -07:00
Zhuohan Li	96853af5a8	Optimize MQA Kernel (#452 )	2023-07-14 20:06:40 -04:00
Andre Slavescu	c894836108	[Model] Add support for GPT-J (#226 ) Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>	2023-07-08 17:55:16 -07:00
Woosuk Kwon	404422f42e	[Model] Add support for MPT (#334 )	2023-07-03 16:47:53 -07:00
Woosuk Kwon	e41f06702c	Add support for BLOOM (#331 )	2023-07-03 13:12:35 -07:00
Woosuk Kwon	0b98ba15c7	Change the name to vLLM (#150 )	2023-06-17 03:07:40 -07:00
Woosuk Kwon	e38074b1e6	Support FP32 (#141 )	2023-06-07 00:40:21 -07:00
Woosuk Kwon	d721168449	Improve setup script & Add a guard for bfloat16 kernels (#130 )	2023-05-27 00:59:32 -07:00
Woosuk Kwon	667ba3995c	Add copyright headers to source files adapted from FT (#104 )	2023-05-14 22:19:19 -07:00
Woosuk Kwon	130d5fd8c7	Fix a bug in attention kernel (#68 )	2023-05-04 02:56:09 -07:00
Woosuk Kwon	e070829ae8	Support bfloat16 data type (#54 )	2023-05-03 14:09:44 -07:00
Woosuk Kwon	436e523bf1	Refactor attention kernels (#53 )	2023-05-03 13:40:13 -07:00
Woosuk Kwon	a96d63c21d	Add support for GPT-NeoX (Pythia) (#50 )	2023-04-28 00:32:10 -07:00
Woosuk Kwon	0f4b32199e	Support various block sizes & Change default block size to 16 (#38 )	2023-04-15 09:03:24 -07:00
Siyuan (Ryans) Zhuang	e3cec88aa5	Memcpy kernel for flash attention (#29 ) * optimize * add benchmark * add assert * add test	2023-04-10 18:22:49 -07:00
Woosuk Kwon	b9926f7f66	Support block size 32 (#35 )	2023-04-09 23:07:18 -07:00
Woosuk Kwon	c267b1a02c	Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script (#27 ) * Add query stride to multi_query_cached_kv_attention * Add kernel benchmark script	2023-04-08 13:36:09 -07:00
Woosuk Kwon	0f40557af6	Implement block copy kernel to optimize beam search (#32 )	2023-04-07 17:45:07 -07:00
Siyuan (Ryans) Zhuang	21b3671bbc	Basic attention kernel that supports cached KV + (multi-)prompts (#24 )	2023-04-04 20:34:46 -07:00
Woosuk Kwon	897cb2ae28	Optimize data movement (#20 )	2023-04-02 00:30:17 -07:00
Woosuk Kwon	09e9245478	Add custom kernel for RMS normalization (#16 )	2023-04-01 00:51:22 +08:00
Woosuk Kwon	88c0268a18	Implement custom kernel for LLaMA rotary embedding (#14 )	2023-03-30 11:04:21 -07:00
Woosuk Kwon	cfae35b861	Add miscellaneous updates (#8 )	2023-03-13 13:48:38 -07:00
Woosuk Kwon	1a7eb7da61	Support beam search & parallel generation (#7 )	2023-03-10 09:58:21 -08:00
Woosuk Kwon	0deacbce6e	Implement `single_query_cached_kv_attention` kernel (#3 )	2023-03-01 15:02:19 -08:00
Woosuk Kwon	c413c41cda	Add reshape_and_cache op	2023-02-18 19:22:57 +00:00
Woosuk Kwon	ffad4e1e03	cache_kernel -> cache_kernels	2023-02-16 20:05:45 +00:00
Woosuk Kwon	6d2f74efb3	Remove redundant fn	2023-02-16 09:24:42 +00:00
Woosuk Kwon	6f058c7ba8	Implement cache ops	2023-02-16 07:47:03 +00:00

1 2 3 4 5

208 Commits