* Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels
* Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code
* std => platform
* fix affine2
* really fix affine2
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>