* Support parallel split K mode for porfiling
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
* Parallel Split K support
1. find gemm kernel by preference key
2. switch m n for redution kernel
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
* parallel splitk for fp16 gemm
* add one missing file
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
CUTLASS 2.1 contributes:
- BLAS-style host-side API added to CUTLASS Library
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Minor enhancements and bug fixes