This PR addresses the Marlin kernel H100 crash that was reported here: neuralmagic#187. The reason for the crash was the inline PTX assembly that introduced the async_copy with streaming behavior. The solution is to use the more standard PTX for async_copy (without the fractional L2 policy for "evict_first"). There is no performance difference between standard async_copy PTX and the previous one. |
||
|---|---|---|
| .. | ||
| aqlm | ||
| awq | ||
| fp8 | ||
| fp8_e5m2_kvcache | ||
| gptq | ||
| marlin | ||
| squeezellm | ||