298 B
298 B
Our implementation uses Apex's FMHA code as a starting point.
We thank Young-jun Ko for the in-depth explanation of his FMHA implementation and for his thoughtful answers to our questions about CUDA.