flash-attention

History

Tri Dao 557781933d Parallelize CUDA bwd along seqlen_k instead of seqlen_q This is faster since we only need to do atomic adds on dq, instead of atomic adds on both dk and dv.		2022-11-05 16:26:17 -07:00
..
gemm.h	Refactor gemm_cl to template on either __half or __nv_bfloat16	2022-07-09 23:18:26 -07:00
gmem_tile.h	Parallelize CUDA bwd along seqlen_k instead of seqlen_q	2022-11-05 16:26:17 -07:00
kernel_traits.h	Split bwd on the seqlen_q dimension	2022-10-23 11:35:15 -07:00
mask.h	Rework dropout to decouple forward and backward	2022-10-21 12:04:27 -07:00
smem_tile.h	Rework dropout to decouple forward and backward	2022-10-21 12:04:27 -07:00
softmax.h	Rework dropout to decouple forward and backward	2022-10-21 12:04:27 -07:00
utils.h	Refactor to template on __half, implement bf16 util functions	2022-07-09 23:18:26 -07:00