vllm/fp8 at 1696efe6c91a82e1aca5b49f4bc7899802115981 - vllm

History

Cody Yu 5985e3427d [Kernel] Vectorized FP8 quantize kernel (#5396 ) Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.		2024-06-12 14:07:26 -07:00
..
amd	[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722 )	2024-05-22 07:18:41 +00:00
nvidia	[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722 )	2024-05-22 07:18:41 +00:00
common.cu	[Kernel] Vectorized FP8 quantize kernel (#5396 )	2024-06-12 14:07:26 -07:00