vllm/csrc at eace8bf0b9118877c390e6d490502214c39db132 - vllm

History

Philipp Moritz eace8bf0b9 [Kernel] FP8 support for MoE kernel / Mixtral (#4244 ) This PR is the first step towards fixing https://github.com/vllm-project/vllm/pull/3208 It implements dynamic per-tensor scaling (see https://github.com/vllm-project/vllm/pull/4118), so users do not need to compute activation scales on a calibration dataset and they also don't need to convert their model checkpoints. It is enough to specify the `quantization="fp8"` argument. You can try out the PR like this: ```python from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1", tensor_parallel_size=2, quantization="fp8") outputs = llm.generate(prompts, sampling_params) # Print the outputs. for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ``` Performance: For this PR, the focus is on making the code clean (while still trying to get reasonable performance), there is a bunch of optimizations that we will submit as a follow up PR that significantly improve the performance (similar to the numbers in https://github.com/vllm-project/vllm/pull/3954). With this PR, the results are as follows: <img width="725" alt="Screenshot 2024-04-21 at 1 31 50 PM" src="https://github.com/vllm-project/vllm/assets/113316/d8fe1118-07a0-4d4e-8530-37a77d465a03"> Accuracy: The accuracy with this PR on MMLU on `mistralai/Mixtral-8x7B-v0.1` is as follows: ``` \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7018\|± \|0.0036\| \| - humanities \|N/A \|none \| 5\|acc \|0.6472\|± \|0.0065\| \| - other \|N/A \|none \| 5\|acc \|0.7673\|± \|0.0072\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8099\|± \|0.0070\| \| - stem \|N/A \|none \| 5\|acc \|0.6131\|± \|0.0083\| ``` this compares favorably with the fp16 results which are ``` \| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\| \|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\| \|mmlu \|N/A \|none \| 0\|acc \|0.7020\|± \|0.1313\| \| - humanities \|N/A \|none \| 5\|acc \|0.6425\|± \|0.1349\| \| - other \|N/A \|none \| 5\|acc \|0.7744\|± \|0.1038\| \| - social_sciences\|N/A \|none \| 5\|acc \|0.8131\|± \|0.0695\| \| - stem \|N/A \|none \| 5\|acc \|0.6108\|± \|0.1383\| ``` Happy hacking!		2024-04-24 01:18:23 +00:00
..
attention	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 )	2024-04-03 14:15:55 -07:00
cpu	[Bugfix] Add kv_scale input parameter to CPU backend (#3840 )	2024-04-04 04:33:08 +00:00
moe	Add fused top-K softmax kernel for MoE (#2769 )	2024-02-05 17:38:02 -08:00
punica	[Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134 )	2024-04-17 10:02:45 -07:00
quantization	[Kernel] FP8 support for MoE kernel / Mixtral (#4244 )	2024-04-24 01:18:23 +00:00
activation_kernels.cu	Add kernel for GeGLU with approximate GELU (#3337 )	2024-03-12 22:06:17 -07:00
cache_kernels.cu	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 )	2024-04-03 14:15:55 -07:00
cache.h	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 )	2024-04-03 14:15:55 -07:00
cuda_compat.h	[ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262 )	2024-03-10 15:27:45 -07:00
cuda_utils_kernels.cu	[ROCm] add support to ROCm 6.0 and MI300 (#2274 )	2024-01-26 12:41:10 -08:00
cuda_utils.h	[ROCm] add support to ROCm 6.0 and MI300 (#2274 )	2024-01-26 12:41:10 -08:00
custom_all_reduce_test.cu	[BugFix] Some fixes for custom allreduce kernels (#2760 )	2024-03-21 23:02:58 -07:00
custom_all_reduce.cu	[BugFix] Some fixes for custom allreduce kernels (#2760 )	2024-03-21 23:02:58 -07:00
custom_all_reduce.cuh	[BugFix] Some fixes for custom allreduce kernels (#2760 )	2024-03-21 23:02:58 -07:00
dispatch_utils.h	DeepseekMoE support with Fused MoE kernel (#2453 )	2024-01-29 21:19:48 -08:00
layernorm_kernels.cu	[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations (#3782 )	2024-04-08 14:31:02 -07:00
moe_align_block_size_kernels.cu	[Bugfix] Make moe_align_block_size AMD-compatible (#3470 )	2024-03-18 11:26:24 -07:00
ops.h	[Kernel] FP8 support for MoE kernel / Mixtral (#4244 )	2024-04-24 01:18:23 +00:00
pos_encoding_kernels.cu	Add batched RoPE kernel (#3095 )	2024-03-13 13:45:26 -07:00
pybind.cpp	[Kernel] FP8 support for MoE kernel / Mixtral (#4244 )	2024-04-24 01:18:23 +00:00
reduction_utils.cuh	[Kernel] Layernorm performance optimization (#3662 )	2024-03-30 14:26:38 -07:00