vllm/kernels at 021b1a2ab7497769dae8a67ea3467e4bafb474c5 - vllm

History

Michael Goin 2a052011ca [Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527 ) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.		2024-05-04 11:45:16 -07:00
..
allclose_default.py	[ROCm] Fix some kernels failed unit tests (#2498 )	2024-02-05 14:25:36 -08:00
conftest.py	[Kernel] Use flashinfer for decoding (#4353 )	2024-05-03 15:51:27 -07:00
test_activation.py	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
test_attention.py	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
test_cache.py	[Kernel] Use flashinfer for decoding (#4353 )	2024-05-03 15:51:27 -07:00
test_layernorm.py	[Kernel] Layernorm performance optimization (#3662 )	2024-03-30 14:26:38 -07:00
test_moe.py	[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with Dynamic/Static Activations) (#4527 )	2024-05-04 11:45:16 -07:00
test_pos_encoding.py	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
test_prefix_prefill.py	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
test_rand.py	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
test_sampler.py	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00