History

Chirag Jain 50896ec574 Make nvcc threads configurable via environment variable (#885 )		2024-03-13 20:46:57 -07:00
..
cuda_bf16_fallbacks.cuh	[Gen] Add kernel from FasterTransformer for benchmarking	2023-01-03 17:37:43 -08:00
cuda_bf16_wrapper.h	[Gen] Add kernel from FasterTransformer for benchmarking	2023-01-03 17:37:43 -08:00
decoder_masked_multihead_attention_template.hpp	[FT] Implement MQA/GQA	2023-07-22 23:47:01 -07:00
decoder_masked_multihead_attention_utils.h	[FT] rotary_cos/sin should have shape (dim) instead of (seqlen, dim)	2023-07-03 09:41:04 -07:00
decoder_masked_multihead_attention.cu	[ft_attention] Fix for seqlen=8136 (#488 )	2023-08-28 10:00:22 -07:00
decoder_masked_multihead_attention.h	[FT] Implement MQA/GQA	2023-07-22 23:47:01 -07:00
ft_attention.cpp	Implement rotary embedding in flash_attn_with_kvcache	2023-09-16 01:20:16 -07:00
README.md	[Gen] Don't use ft_attention, use flash_attn_with_kvcache instead	2023-09-18 15:29:06 -07:00
setup.py	Make nvcc threads configurable via environment variable (#885 )	2024-03-13 20:46:57 -07:00

README.md

Attention kernel from FasterTransformer

This CUDA extension wraps the single-query attention kernel from FasterTransformer v5.2.1 for benchmarking purpose.

cd csrc/ft_attention && pip install .

As of 2023-09-17, this extension is no longer used in the FlashAttention repo. FlashAttention now has implemented flash_attn_with_kvcache with all the features of this ft_attention kernel (and more).