* Split apart gemm reference templates into multiple TUs for parallel compilation
* remove old files
* better balancing of ref kernels across TUs
* remove 3 new added refcheck kernels and some un-necessary fp8 library instances to reduce lib size
* remove auto fp8 kernels
* remove some redundant kernels