flash-attention

History

Tri Dao 557781933d Parallelize CUDA bwd along seqlen_k instead of seqlen_q This is faster since we only need to do atomic adds on dq, instead of atomic adds on both dk and dv.		2022-11-05 16:26:17 -07:00
..
test_flash_attn.py	Parallelize CUDA bwd along seqlen_k instead of seqlen_q	2022-11-05 16:26:17 -07:00
test_rotary.py	Implement rotary embedding in CUDA	2022-11-04 22:42:01 -07:00