* update ck
* update ck
* update ck again
* update ck
* use pointer as seed and offset
* update CK
* Remove useless "else"
* Fix page-attn block table read out-of-bound
---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
* Add custom ops for compatibility with PT Compile
* Add support for varlen functions too
* Add version checks for pytorch API
* Fix PT compile interfaces so it works e2e
* Make sure PT < 2.4 runs fine
* Fix python mistake
* Fix all the autograd magic issues
* typo on head_dim
* Fix deterministic test failures, remove unneeded detaches()
* remove test requires_grad
* Resolve all the pytorch versioning issues
* C++ and python refactor to improve padding management for torch.compile()
* Add improvements suggested by @anijain2305
* Integrate ck branch of ck_tile/fa_bwd_opt
* Assume dq and q share the same stride
* update ck
* Integrate more stride of dq_acc
* Revert fwd dropout
* Fix paremeter order
* Integrate ck with more stride
* update the limit of hdim of bwd
* Check argument
* Add test_flash_attn_causal
* Support unpad lse
* Add test_flash_attn_varlen_causal, test_flash_attn_race_condition, test_flash_attn_bwd_overflow, test_flash_attn_bwd_transpose, test_flash_attn_bwd_varlen_overflow, test_flash_attn_deterministic, test_flash_attn_varlen_deterministic
* Fix stride and Kn0
* Fix CK sync issue
* Fix typo
* Update CK for changing of fmha_fwd_args
* Add kvcache tmp
* Add kvcache
* Fix comment
* Sync behavior with ck
* Update CK to develop
* remove large test case
* Add kvcache test
* Fix page_block_size in arg
* Minor fix
* Fix stride error
* Update seqlen of kvcache before splitkv
* Fix compile error
* Fix bug of hdim is not 8x
* Fit ck arg
* support adaptive num_splits
* add more tests
* Refine test tolerance
* update CK
* Move override_num_splits_if_necessary into cpp
* update ck
* Update ck
* Support different flag for different version of hip
* remove coerce-illegal, becasue this is not required in FA
* Update ck to fix xcratch memory
* Add coerce-illegal in some version
* Add compile flag for rtn rounding
* remove redundant init
* Using env var to switch rounding mode
* update ck