flash-attention

History

Tri Dao 557781933d Parallelize CUDA bwd along seqlen_k instead of seqlen_q This is faster since we only need to do atomic adds on dq, instead of atomic adds on both dk and dv.		2022-11-05 16:26:17 -07:00
..
flash_attn	Parallelize CUDA bwd along seqlen_k instead of seqlen_q	2022-11-05 16:26:17 -07:00
fused_softmax	Add Megatron attention implementation for benchmarking	2022-10-23 23:04:16 -07:00
rotary	Implement rotary embedding in CUDA	2022-11-04 22:42:01 -07:00