flash-attention

History

Tri Dao 5b838a8bef Apply dropout scaling to dQ and dK instead of to V (in bwd) Theoretically this might have lower numerical error since the scaling is in fp32 instead of fp16 (not sure, I haven't thought too carefully about it). However, in practice, the numerical errors seem about the same.	2022-07-03 17:53:37 -07:00
..
flash_attn	Apply dropout scaling to dQ and dK instead of to V (in bwd)	2022-07-03 17:53:37 -07:00

Tri Dao 5b838a8bef Apply dropout scaling to dQ and dK instead of to V (in bwd)

Theoretically this might have lower numerical error since the scaling is in
fp32 instead of fp16 (not sure, I haven't thought too carefully about it).
However, in practice, the numerical errors seem about the same.

2022-07-03 17:53:37 -07:00

flash_attn

Apply dropout scaling to dQ and dK instead of to V (in bwd)

2022-07-03 17:53:37 -07:00