* base version
* restructure pipelines, add special fp8 epilogue
* add variants
* add fp8 causal and modify dynamic tile scheduler
* better causal schedule
* maintain two schedules for non causal and causal
* removing macros
* fix regression
* clean up unneeded methods and variants
* fix mistake with NumProducerThreads
* base version
* restructure pipelines, add special fp8 epilogue
* add variants
* add fp8 causal and modify dynamic tile scheduler
* better causal schedule
* maintain two schedules for non causal and causal
* removing macros
* fix regression
* clean up unneeded methods and variants
* fix mistake with NumProducerThreads
* use seqlen traits
* add fp8 .cu files and benchmark script
* fix merge issue
* fix merge issue
* fix merge issue
* remove duplicate code
* fix regression with varseqlen
* move varseqlen init in constexpr
* fix test script
* more constexpr on varseqlen and add max offset
* add back test cases