SangBin Cho
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
alexm-nm
7038e8b803
[Kernel] Support running GPTQ 8-bit models in Marlin ( #4533 )
2024-05-02 12:56:22 -04:00
Robert Shaw
73c8d677e5
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin ( #3922 )
...
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-29 09:35:34 -07:00
Austin Veselka
eefeb16464
[Kernel] Full Tensor Parallelism for LoRA Layers ( #3524 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-04-27 00:03:48 -07:00
Philipp Moritz
12628d3c78
[Kernel] Optimize FP8 support for MoE kernel / Mixtral via static scales ( #4343 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-27 04:49:59 +00:00
alexm-nm
aae08249ac
[Bugfix] Fix marlin kernel crash on H100 ( #4218 )
...
This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187.
The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one.
2024-04-24 10:35:01 -07:00
Woosuk Kwon
468d761b32
[Misc] Reduce supported Punica dtypes ( #4304 )
2024-04-23 18:54:33 -07:00
Philipp Moritz
eace8bf0b9
[Kernel] FP8 support for MoE kernel / Mixtral ( #4244 )
...
This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208
It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118 ), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this:
```python
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8")
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
```
**Performance**: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954 ). With this PR, the results are as follows:
<img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03 ">
**Accuracy**: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows:
```
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7018|± |0.0036|
| - humanities |N/A |none | 5|acc |0.6472|± |0.0065|
| - other |N/A |none | 5|acc |0.7673|± |0.0072|
| - social_sciences|N/A |none | 5|acc |0.8099|± |0.0070|
| - stem |N/A |none | 5|acc |0.6131|± |0.0083|
```
this compares favorably with the fp16 results which are
```
| Groups |Version|Filter|n-shot|Metric|Value | |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu |N/A |none | 0|acc |0.7020|± |0.1313|
| - humanities |N/A |none | 5|acc |0.6425|± |0.1349|
| - other |N/A |none | 5|acc |0.7744|± |0.1038|
| - social_sciences|N/A |none | 5|acc |0.8131|± |0.0695|
| - stem |N/A |none | 5|acc |0.6108|± |0.1383|
```
Happy hacking!
2024-04-24 01:18:23 +00:00
James Fleming
2b7949c1c2
AQLM CUDA support ( #3287 )
...
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Shoichi Uchinami
a53222544c
[Kernel] Add punica dimension for Swallow-MS-7B LoRA ( #4134 )
2024-04-17 10:02:45 -07:00
Jee Li
989ae2538d
[Kernel] Add punica dimension for Baichuan-13B ( #4053 )
2024-04-13 07:55:05 -07:00
Antoni Baum
1e96c3341a
Add extra punica sizes to support bigger vocabs ( #4015 )
2024-04-11 22:18:57 +00:00
Antoni Baum
a10d3056da
[Core] Set linear_weights directly on the layer ( #3977 )
2024-04-11 16:35:51 -04:00
fuchen.ljl
08ccee1e83
punica fix-bgmv-kernel-640 ( #4007 )
2024-04-11 08:59:26 -07:00
Matt Wong
59a6abf3c9
[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations ( #3782 )
2024-04-08 14:31:02 -07:00
Woosuk Kwon
498eb5cfa3
[Bugfix] Add kv_scale input parameter to CPU backend ( #3840 )
2024-04-04 04:33:08 +00:00
Adrian Abeyta
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) ( #3290 )
...
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
bigPYJ1151
0e3f06fe9c
[Hardware][Intel] Add CPU inference backend ( #3634 )
...
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
2024-04-01 22:07:30 -07:00
mawong-amd
b6d103542c
[Kernel] Layernorm performance optimization ( #3662 )
2024-03-30 14:26:38 -07:00
Jee Li
566b57c5c4
[Kernel] support non-zero cuda devices in punica kernels ( #3636 )
2024-03-27 00:37:42 +00:00
Jee Li
8af890a865
Enable more models to inference based on LoRA ( #3382 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-03-25 18:09:31 -07:00
Hanzhi Zhou
f721096d48
[BugFix] Some fixes for custom allreduce kernels ( #2760 )
2024-03-21 23:02:58 -07:00
Woosuk Kwon
9101d832e6
[Bugfix] Make moe_align_block_size AMD-compatible ( #3470 )
2024-03-18 11:26:24 -07:00
Simon Mo
8e67598aa6
[Misc] fix line length for entire codebase ( #3444 )
2024-03-16 00:36:29 -07:00
akhoroshev
78b6c4845a
Dynamically configure shared memory size for moe_align_block_size_kernel ( #3376 )
2024-03-14 18:18:07 -07:00
Terry
7e9bd08f60
Add batched RoPE kernel ( #3095 )
2024-03-13 13:45:26 -07:00
Or Sharir
ae0ccb4017
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. ( #3350 )
2024-03-13 12:18:25 -07:00
Woosuk Kwon
602358f8a8
Add kernel for GeGLU with approximate GELU ( #3337 )
2024-03-12 22:06:17 -07:00
kliuae
c9415c19d3
[ROCm] Fix warp and lane calculation in blockReduceSum ( #3321 )
2024-03-11 13:14:07 -07:00
Douglas Lehr
e4a28e5316
[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA ( #3262 )
2024-03-10 15:27:45 -07:00
Terry
0bba88df03
Enhance lora tests with more layer and rank variations ( #3243 )
2024-03-09 17:14:16 -08:00
whyiug
c59e120c55
Feature add lora support for Qwen2 ( #3177 )
2024-03-07 21:58:24 -08:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference ( #2497 )
...
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
CHU Tianxiang
01a5d18a53
Add Support for 2/3/8-bit GPTQ Quantization Models ( #2330 )
2024-02-28 21:52:23 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma ( #3050 )
2024-02-28 13:03:28 -08:00
Woosuk Kwon
d6e4a130b0
[Minor] Remove gather_cached_kv kernel ( #3043 )
2024-02-26 15:00:54 -08:00
Woosuk Kwon
fd5dcc5c81
Optimize GeGLU layer in Gemma ( #2975 )
2024-02-21 20:17:52 -08:00
Rex
563836496a
Refactor 2 awq gemm kernels into m16nXk32 ( #2723 )
...
Co-authored-by: Chunan Zeng <chunanzeng@Chunans-Air.attlocal.net>
2024-02-12 11:02:17 -08:00
Woosuk Kwon
f0d4e14557
Add fused top-K softmax kernel for MoE ( #2769 )
2024-02-05 17:38:02 -08:00
zhaoyang-star
923797fea4
Fix compile error when using rocm ( #2648 )
2024-02-01 09:35:09 -08:00
Philipp Moritz
ab40644669
Fused MOE for Mixtral ( #2542 )
...
Co-authored-by: chen shen <scv119@gmail.com>
2024-01-29 22:43:37 -08:00
wangding zeng
5d60def02c
DeepseekMoE support with Fused MoE kernel ( #2453 )
...
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
Hanzhi Zhou
1b20639a43
No repeated IPC open ( #2642 )
2024-01-29 10:46:29 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache ( #2279 )
...
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Woosuk Kwon
f8ecb84c02
Speed up Punica compilation ( #2632 )
2024-01-27 17:46:56 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels ( #2192 )
2024-01-27 12:46:35 -08:00
Casper
beb89f68b4
AWQ: Up to 2.66x higher throughput ( #2566 )
2024-01-26 23:53:17 -08:00
Hongxia Yang
6b7de1a030
[ROCm] add support to ROCm 6.0 and MI300 ( #2274 )
2024-01-26 12:41:10 -08:00
Vladimir
5265631d15
use a correct device when creating OptionalCUDAGuard ( #2583 )
2024-01-25 23:48:17 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support ( #1804 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Woosuk Kwon
6ef00b03a2
Enable CUDA graph for GPTQ & SqueezeLLM ( #2318 )
2024-01-03 09:52:29 -08:00
Jee Li
77af974b40
[FIX] Support non-zero CUDA devices in custom kernels ( #1959 )
2024-01-02 19:09:59 -08:00
kliuae
1b7c791d60
[ROCm] Fixes for GPTQ on ROCm ( #2180 )
2023-12-18 10:41:04 -08:00
Woosuk Kwon
76a7983b23
[BugFix] Fix RoPE kernel on long sequences( #2164 )
2023-12-17 17:09:10 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Mingcan Xiang
614856da25
Avoid multiple redefinition ( #1817 )
2023-12-14 09:35:58 -08:00
wbn
dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. ( #1997 )
...
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
TJian
6ccc0bfffb
Merge EmbeddedLLM/vllm-rocm into vLLM main ( #1836 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
Yanming W
e0c6f556e8
[Build] Avoid building too many extensions ( #1624 )
2023-11-23 16:31:19 -08:00
ljss
e1054247ba
[Optimization] Implement fused add rmsnorm ( #1667 )
2023-11-18 18:18:02 -08:00
Antoni Baum
9f669a9a7c
Support YaRN models ( #1264 )
...
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-03 14:12:48 -07:00
Woosuk Kwon
0ce8647dc5
Fix integer overflows in attention & cache ops ( #1514 )
2023-10-31 15:19:30 -07:00
chooper1
1f24755bf8
Support SqueezeLLM ( #1326 )
...
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Woosuk Kwon
c1376e0f82
Change scheduler & input tensor shape ( #1381 )
2023-10-16 17:48:42 -07:00
Woosuk Kwon
928de46888
Implement PagedAttention V2 ( #1348 )
2023-10-16 00:59:57 -07:00
Woosuk Kwon
29678cd213
Minor fix on AWQ kernel launch ( #1356 )
2023-10-15 21:53:56 -07:00
CHU Tianxiang
980dd4a2c4
Fix overflow in awq kernel ( #1295 )
...
Co-authored-by: 楚天翔 <tianxiang.ctx@alibaba-inc.com>
2023-10-11 00:19:53 -07:00
twaka
8285736840
workaround of AWQ for Turing GPUs ( #1252 )
2023-10-10 19:48:16 -07:00
Liang
ebe4d1db3a
Fix boundary check in paged attention kernel ( #1241 )
2023-10-01 11:35:06 -07:00
Antoni Baum
cf5cb1e33e
Allocate more shared memory to attention kernel ( #1154 )
2023-09-26 22:27:13 -07:00
Woosuk Kwon
2b1c116b5a
Add minimum capability requirement for AWQ ( #1064 )
2023-09-18 12:02:01 -07:00
Woosuk Kwon
e3e79e9e8a
Implement AWQ quantization support for LLaMA ( #1032 )
...
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
2023-09-16 00:03:37 -07:00
Zhuohan Li
db09d4ad83
[FIX] Fix Alibi implementation in PagedAttention kernel ( #945 )
...
* [FIX] Fix Alibi implementation in PagedAttention kernel
* Fix test_attention
* Fix
---------
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Oliver-ss <yuansongwx@outlook.com>
2023-09-07 15:53:14 -07:00
Woosuk Kwon
320a622ec4
[BugFix] Implement RoPE for GPT-J ( #941 )
2023-09-06 11:54:33 +09:00
Woosuk Kwon
bf87484efa
[BugFix] Fix NaN errors in paged attention kernel ( #936 )
2023-09-04 09:20:06 +09:00
Woosuk Kwon
8ce9c50d40
Avoid compiling kernels for double data type ( #933 )
2023-09-02 14:59:47 +09:00
Woosuk Kwon
d64bf1646c
Implement approximate GELU kernels ( #828 )
2023-08-23 07:43:21 +09:00
Dean Leitersdorf
79af7e96a0
[OPTIMIZATION] Optimizes the single_query_cached_kv_attention kernel ( #420 )
2023-08-04 10:57:29 -07:00
Zhuohan Li
1b0bd0fe8a
Add Falcon support (new) ( #592 )
2023-08-02 14:04:39 -07:00
Zhuohan Li
6fc2a38b11
Add support for LLaMA-2 ( #505 )
2023-07-20 11:38:27 -07:00
Zhuohan Li
96853af5a8
Optimize MQA Kernel ( #452 )
2023-07-14 20:06:40 -04:00
Andre Slavescu
c894836108
[Model] Add support for GPT-J ( #226 )
...
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
2023-07-08 17:55:16 -07:00
Woosuk Kwon
404422f42e
[Model] Add support for MPT ( #334 )
2023-07-03 16:47:53 -07:00
Woosuk Kwon
e41f06702c
Add support for BLOOM ( #331 )
2023-07-03 13:12:35 -07:00
Woosuk Kwon
0b98ba15c7
Change the name to vLLM ( #150 )
2023-06-17 03:07:40 -07:00
Woosuk Kwon
e38074b1e6
Support FP32 ( #141 )
2023-06-07 00:40:21 -07:00
Woosuk Kwon
d721168449
Improve setup script & Add a guard for bfloat16 kernels ( #130 )
2023-05-27 00:59:32 -07:00
Woosuk Kwon
667ba3995c
Add copyright headers to source files adapted from FT ( #104 )
2023-05-14 22:19:19 -07:00
Woosuk Kwon
130d5fd8c7
Fix a bug in attention kernel ( #68 )
2023-05-04 02:56:09 -07:00
Woosuk Kwon
e070829ae8
Support bfloat16 data type ( #54 )
2023-05-03 14:09:44 -07:00
Woosuk Kwon
436e523bf1
Refactor attention kernels ( #53 )
2023-05-03 13:40:13 -07:00
Woosuk Kwon
a96d63c21d
Add support for GPT-NeoX (Pythia) ( #50 )
2023-04-28 00:32:10 -07:00
Woosuk Kwon
0f4b32199e
Support various block sizes & Change default block size to 16 ( #38 )
2023-04-15 09:03:24 -07:00
Siyuan (Ryans) Zhuang
e3cec88aa5
Memcpy kernel for flash attention ( #29 )
...
* optimize
* add benchmark
* add assert
* add test
2023-04-10 18:22:49 -07:00
Woosuk Kwon
b9926f7f66
Support block size 32 ( #35 )
2023-04-09 23:07:18 -07:00
Woosuk Kwon
c267b1a02c
Add query stride to multi_query_cached_kv_attention & Add kernel benchmark script ( #27 )
...
* Add query stride to multi_query_cached_kv_attention
* Add kernel benchmark script
2023-04-08 13:36:09 -07:00
Woosuk Kwon
0f40557af6
Implement block copy kernel to optimize beam search ( #32 )
2023-04-07 17:45:07 -07:00
Siyuan (Ryans) Zhuang
21b3671bbc
Basic attention kernel that supports cached KV + (multi-)prompts ( #24 )
2023-04-04 20:34:46 -07:00
Woosuk Kwon
897cb2ae28
Optimize data movement ( #20 )
2023-04-02 00:30:17 -07:00
Woosuk Kwon
09e9245478
Add custom kernel for RMS normalization ( #16 )
2023-04-01 00:51:22 +08:00
Woosuk Kwon
88c0268a18
Implement custom kernel for LLaMA rotary embedding ( #14 )
2023-03-30 11:04:21 -07:00
Woosuk Kwon
cfae35b861
Add miscellaneous updates ( #8 )
2023-03-13 13:48:38 -07:00
Woosuk Kwon
1a7eb7da61
Support beam search & parallel generation ( #7 )
2023-03-10 09:58:21 -08:00
Woosuk Kwon
0deacbce6e
Implement single_query_cached_kv_attention kernel ( #3 )
2023-03-01 15:02:19 -08:00
Woosuk Kwon
c413c41cda
Add reshape_and_cache op
2023-02-18 19:22:57 +00:00
Woosuk Kwon
ffad4e1e03
cache_kernel -> cache_kernels
2023-02-16 20:05:45 +00:00
Woosuk Kwon
6d2f74efb3
Remove redundant fn
2023-02-16 09:24:42 +00:00
Woosuk Kwon
6f058c7ba8
Implement cache ops
2023-02-16 07:47:03 +00:00