flash-attention/csrc/ft_attention
dan_the_3rd c3f2a632aa
[ft_attention] Fix for seqlen=8136 (#488)
When seqlen=8136, `smem_sz = 48840`, and apparently starting the kernel returns an `invalid argument` CUDA error.

`48840 < 48 * 1024` but apparently it's still above the limit somehow..?
Tested on A100
2023-08-28 10:00:22 -07:00
..
cuda_bf16_fallbacks.cuh [Gen] Add kernel from FasterTransformer for benchmarking 2023-01-03 17:37:43 -08:00
cuda_bf16_wrapper.h [Gen] Add kernel from FasterTransformer for benchmarking 2023-01-03 17:37:43 -08:00
decoder_masked_multihead_attention_template.hpp [FT] Implement MQA/GQA 2023-07-22 23:47:01 -07:00
decoder_masked_multihead_attention_utils.h [FT] rotary_cos/sin should have shape (dim) instead of (seqlen, dim) 2023-07-03 09:41:04 -07:00
decoder_masked_multihead_attention.cu [ft_attention] Fix for seqlen=8136 (#488) 2023-08-28 10:00:22 -07:00
decoder_masked_multihead_attention.h [FT] Implement MQA/GQA 2023-07-22 23:47:01 -07:00
ft_attention.cpp [FT] Implement MQA/GQA 2023-07-22 23:47:01 -07:00
README.md [Gen] Add kernel from FasterTransformer for benchmarking 2023-01-03 17:37:43 -08:00
setup.py Support H100 for other CUDA extensions 2023-03-15 16:59:27 -07:00

Attention kernel from FasterTransformer

This CUDA extension wraps the single-query attention kernel from FasterTransformer v5.2.1 for benchmarking purpose.

cd csrc/ft_attention && pip install .