Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance. |
||
|---|---|---|
| .. | ||
| allclose_default.py | ||
| conftest.py | ||
| test_activation.py | ||
| test_attention.py | ||
| test_cache.py | ||
| test_layernorm.py | ||
| test_moe.py | ||
| test_pos_encoding.py | ||
| test_prefix_prefill.py | ||
| test_rand.py | ||
| test_sampler.py | ||