flash-attention/csrc
Tri Dao 557781933d Parallelize CUDA bwd along seqlen_k instead of seqlen_q
This is faster since we only need to do atomic adds on dq, instead of atomic
adds on both dk and dv.
2022-11-05 16:26:17 -07:00
..
flash_attn Parallelize CUDA bwd along seqlen_k instead of seqlen_q 2022-11-05 16:26:17 -07:00
fused_softmax Add Megatron attention implementation for benchmarking 2022-10-23 23:04:16 -07:00
rotary Implement rotary embedding in CUDA 2022-11-04 22:42:01 -07:00