vllm/quantization at 2135cacb457b7daf1143c8465ab72650eaa4dd7e - vllm

History

Cody Yu 5985e3427d [Kernel] Vectorized FP8 quantize kernel (#5396 ) Inspired by #5146, this PR improves FP8 quantize kernel by vectorizing data transfer to better utilize memory bandwidth. Microbenchmark shows that this improved kernel can achieve 1.0x-1.5x speedup (especially when hidden size is large). In details, we applied 3 optimizations: - Use inverted scale so that most divisions are changed to multiplications. - Unroll the loop by 4 times to improve ILP. - Use vectorized 4 to transfer data between HBM and SRAM.		2024-06-12 14:07:26 -07:00
..
__init__.py	[CI/Build] Move `test_utils.py` to `tests/utils.py` (#4425 )	2024-05-13 23:50:09 +09:00
test_bitsandbytes.py	Revert "[CI/Build] Add `is_quant_method_supported` to control quantization test configurations" (#5463 )	2024-06-12 10:03:24 -07:00
test_compressed_tensors.py	[Misc] Update to comply with the new `compressed-tensors` config (#5350 )	2024-06-10 03:49:46 +00:00
test_configs.py	[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin (#3922 )	2024-04-29 09:35:34 -07:00
test_fp8.py	[Kernel] Vectorized FP8 quantize kernel (#5396 )	2024-06-12 14:07:26 -07:00