flash-attention/tests
Tri Dao 557781933d Parallelize CUDA bwd along seqlen_k instead of seqlen_q
This is faster since we only need to do atomic adds on dq, instead of atomic
adds on both dk and dv.
2022-11-05 16:26:17 -07:00
..
test_flash_attn.py Parallelize CUDA bwd along seqlen_k instead of seqlen_q 2022-11-05 16:26:17 -07:00
test_rotary.py Implement rotary embedding in CUDA 2022-11-04 22:42:01 -07:00