flash-attention

History

Sanghun Cho e4f726fc44 Support alibi, by Sanghun Cho from Kakao Brain * hard-code alibi in fwd * use params.h as hun_heads * hard-code alibi in bwd * add alibi on/off option * compute alibi_start, ratio outside of kernels * fix minor merge conflict * add test_alibi.py * change apply_alibi() location before masking * add alibi in splitkv kernel * fix backward func # of returns * add out-of-bound check in apply_alibi() * update test_alibi.py * update test_alibi.py for kvcache * simplify alibi parameter interface * fix performance issue by computing alibi outside of branch * update test_flash_attn_varlen_func() for left padding * implement alibi_slopes (b, nh) loading * optimize apply_alibi() a bit * update test cases for alibi_slopes loading * reflect stylistic comments * disable "seqlenq_ngroups_swapped" when using alibi --------- Co-authored-by: monk.detective <monk.detective@kakaobrain.com>		2023-12-19 22:56:06 -08:00
..
cutlass@44c704eae8	Update cutlass to 3.2.2	2023-11-19 21:43:48 -08:00
flash_attn	Support alibi, by Sanghun Cho from Kakao Brain	2023-12-19 22:56:06 -08:00
ft_attention	[Gen] Don't use ft_attention, use flash_attn_with_kvcache instead	2023-09-18 15:29:06 -07:00
fused_dense_lib	[FusedDense] Allocate lt_workspace on input device	2023-05-30 14:17:26 -07:00
fused_softmax	Add Megatron attention implementation for benchmarking	2022-10-23 23:04:16 -07:00
layer_norm	Fix random state for dropout_layer_norm (#315 )	2023-07-23 15:05:13 -07:00
rotary	Support H100 for other CUDA extensions	2023-03-15 16:59:27 -07:00
xentropy	[CE] Implement CrossEntropyLoss in Triton	2023-09-15 20:05:28 -07:00