When seqlen=8136, `smem_sz = 48840`, and apparently starting the kernel returns an `invalid argument` CUDA error. `48840 < 48 * 1024` but apparently it's still above the limit somehow..? Tested on A100 |
||
|---|---|---|
| .. | ||
| cuda_bf16_fallbacks.cuh | ||
| cuda_bf16_wrapper.h | ||
| decoder_masked_multihead_attention_template.hpp | ||
| decoder_masked_multihead_attention_utils.h | ||
| decoder_masked_multihead_attention.cu | ||
| decoder_masked_multihead_attention.h | ||
| ft_attention.cpp | ||
| README.md | ||
| setup.py | ||