vllm/csrc
Philipp Moritz a98187cf72
[Kernel] Make static FP8 scaling more robust (#4570)
Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint

https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale

(which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU:

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.2295|±  |0.0035|
| - humanities     |N/A    |none  |     5|acc   |0.2421|±  |0.0062|
| - other          |N/A    |none  |     5|acc   |0.2398|±  |0.0076|
| - social_sciences|N/A    |none  |     5|acc   |0.2171|±  |0.0074|
| - stem           |N/A    |none  |     5|acc   |0.2125|±  |0.0073|
With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is

|      Groups      |Version|Filter|n-shot|Metric|Value |   |Stderr|
|------------------|-------|------|-----:|------|-----:|---|-----:|
|mmlu              |N/A    |none  |     0|acc   |0.7008|±  |0.0036|
| - humanities     |N/A    |none  |     5|acc   |0.6453|±  |0.0065|
| - other          |N/A    |none  |     5|acc   |0.7692|±  |0.0072|
| - social_sciences|N/A    |none  |     5|acc   |0.8083|±  |0.0070|
| - stem           |N/A    |none  |     5|acc   |0.6115|±  |0.0083|
This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.
2024-05-06 17:39:28 -07:00
..
attention [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) 2024-05-03 10:20:12 -07:00
cpu [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) 2024-05-03 10:20:12 -07:00
moe Add fused top-K softmax kernel for MoE (#2769) 2024-02-05 17:38:02 -08:00
punica [Kernel] Full Tensor Parallelism for LoRA Layers (#3524) 2024-04-27 00:03:48 -07:00
quantization [Kernel] Make static FP8 scaling more robust (#4570) 2024-05-06 17:39:28 -07:00
activation_kernels.cu Add kernel for GeGLU with approximate GELU (#3337) 2024-03-12 22:06:17 -07:00
cache_kernels.cu [Kernel] Use flashinfer for decoding (#4353) 2024-05-03 15:51:27 -07:00
cache.h [Kernel] Use flashinfer for decoding (#4353) 2024-05-03 15:51:27 -07:00
cuda_compat.h [ROCM] Fix blockReduceSum to use correct warp counts for ROCm and CUDA (#3262) 2024-03-10 15:27:45 -07:00
cuda_utils_kernels.cu [ROCm] add support to ROCm 6.0 and MI300 (#2274) 2024-01-26 12:41:10 -08:00
cuda_utils.h [ROCm] add support to ROCm 6.0 and MI300 (#2274) 2024-01-26 12:41:10 -08:00
custom_all_reduce_test.cu [BugFix] Some fixes for custom allreduce kernels (#2760) 2024-03-21 23:02:58 -07:00
custom_all_reduce.cu [BugFix] Some fixes for custom allreduce kernels (#2760) 2024-03-21 23:02:58 -07:00
custom_all_reduce.cuh [BugFix] Some fixes for custom allreduce kernels (#2760) 2024-03-21 23:02:58 -07:00
dispatch_utils.h DeepseekMoE support with Fused MoE kernel (#2453) 2024-01-29 21:19:48 -08:00
layernorm_kernels.cu [Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations (#3782) 2024-04-08 14:31:02 -07:00
moe_align_block_size_kernels.cu [Bugfix] Make moe_align_block_size AMD-compatible (#3470) 2024-03-18 11:26:24 -07:00
ops.h [Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) 2024-05-03 10:20:12 -07:00
pos_encoding_kernels.cu Add batched RoPE kernel (#3095) 2024-03-13 13:45:26 -07:00
pybind.cpp [Kernel] Use flashinfer for decoding (#4353) 2024-05-03 15:51:27 -07:00
reduction_utils.cuh [Kernel] Layernorm performance optimization (#3662) 2024-03-30 14:26:38 -07:00