* Enable shared memory intrinsics and ldmatrix PTX on Clang.
This commit adds preprocessor checks to enable the shared memory
intrinsics `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer`, as
well as the `ldmatrix` PTX instructions, on Clang. Preventing these
intrinsics from being used is a significant latency regression on Clang.
* refine the macro
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>