flash-attention/csrc/flash_attn/src/fmha
Tri Dao 557781933d Parallelize CUDA bwd along seqlen_k instead of seqlen_q
This is faster since we only need to do atomic adds on dq, instead of atomic
adds on both dk and dv.
2022-11-05 16:26:17 -07:00
..
gemm.h Refactor gemm_cl to template on either __half or __nv_bfloat16 2022-07-09 23:18:26 -07:00
gmem_tile.h Parallelize CUDA bwd along seqlen_k instead of seqlen_q 2022-11-05 16:26:17 -07:00
kernel_traits.h Split bwd on the seqlen_q dimension 2022-10-23 11:35:15 -07:00
mask.h Rework dropout to decouple forward and backward 2022-10-21 12:04:27 -07:00
smem_tile.h Rework dropout to decouple forward and backward 2022-10-21 12:04:27 -07:00
softmax.h Rework dropout to decouple forward and backward 2022-10-21 12:04:27 -07:00
utils.h Refactor to template on __half, implement bf16 util functions 2022-07-09 23:18:26 -07:00