* base version
* restructure pipelines, add special fp8 epilogue
* add variants
* add fp8 causal and modify dynamic tile scheduler
* better causal schedule
* maintain two schedules for non causal and causal
* removing macros
* fix regression
* clean up unneeded methods and variants
* fix mistake with NumProducerThreads
* base version
* restructure pipelines, add special fp8 epilogue
* add variants
* add fp8 causal and modify dynamic tile scheduler
* better causal schedule
* maintain two schedules for non causal and causal
* removing macros
* fix regression
* clean up unneeded methods and variants
* fix mistake with NumProducerThreads
* use seqlen traits
* add fp8 .cu files and benchmark script
* fix merge issue
* fix merge issue
* fix merge issue
* remove duplicate code
* fix regression with varseqlen
* move varseqlen init in constexpr
* fix test script
* more constexpr on varseqlen and add max offset
* add back test cases
* adding files for fp8 changes.
* removed contiguous check.
* enable all tests except odd-seq-lengths, where it crashes now.
* undid clang formatting.
* change to correct tile size for headdim=128.
* fixed odd-seq-len-k.
* minor formatting.
* minor reformatting.
---------
Co-authored-by: Tri Dao <tridao@users.noreply.github.com>