* Release 3.3.0
Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
* minor doc update
* [WIP] GEMM StreamK w/ Fused Epilogue
* Adds Gemm Streamk with Fused Epilogue kernel level struct.
* Mostly based on Gemm with Fused Epilogue,
* Requires a new epilogue
* Work in progress
* [WIP] StreamK support for GemmUniversalWithBroadcast
* Just based off of how StreamK is allowed in GemmUniversal
* Untested and a work in progress
* Minor fixes
* [WIP] It compiles!
It is almost certainly incorrect, but we're past getting the templates
to match, so checkpointing.
* Correction to reference kernel
* Fix typo
* Added MSE measurement
* Switch back to reference kernel + host for loop
Still WIP. Now we're getting even a larger MSE, but it's both on
basic Split-K and Stream-K.
* Fix typos
* Fix broadcast vector + requested changes
* Comment typo
* Small int option and more
* Fix incorrect condition on source needed
* Requested changes
* I think I got it?
* Bias vector should be stride 0
* Two source added!
* Typos
* Merge examples
* Bring back vector row offset
Just to ensure consistency with universal gemm with fused epilogue
* Base arguments and params structs for StreamK
* StreamK epilogue with broadcast now inherits the original
* undo params_streamk_base.h
---------
Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>