* Release 3.3.0
Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
* minor doc update
Work around a likely GCC 8.x issue with fold expressions
and generic lambdas.
Only use the work-around when the host compiler is GCC 8.x.
This avoids any concerns about the work-around possibly
hindering inlining for a critical CuTe function (product).
Users can experiment with the work-around for other compilers
or compiler versions by defining the following macro.
CUTE_FOLD_GENERIC_LAMBDA_WORKAROUND
Fixes https://github.com/NVIDIA/cutlass/issues/788
Co-authored-by: Mark Hoemmen <mhoemmen@nvidia.com>