* Fix unrelated MSVC build warnings
* Fix use of isnan in functional.h
Correct namespace qualification of isnan in functional.h
so that it invokes cutlass::isnan for half_t, instead of
converting half_t to float and invoking std::isnan (on host,
or ::isnan on device).
Adds 128x256 tile shapes to FP16/BF16 and FP8 generators.
Also adds 1x1x1 clusters to all existing FP16/BF16/FP8 generators.
NOTE: it is important to set kernel filter (--kernels /
CUTLASS_LIBRARY_KERNELS) to a non empty string and skip pruning to get
all of the new configurations.
If profiling exhaustively, they can be set to `*`.
Number of CUTLASS 3.X GEMMs before this commit: 2868
Number of CUTLASS 3.X GEMMs after this commit: 4016
Co-authored-by: Ali Hassani <ahassani@nvidia.com>
* Skip void-C kernels in the profiler when beta is non zero
CUTLASS profiler will only skip disposition for void-C kernels when beta
is non zero, when it makes more sense to skip running it in the first
place.
Not all users are aware of void-C kernels (as far as I know it wasn't a
thing in 2.X), and not everyone remembers to filter out voidC kernels
when running the profiler with a non zero beta.
The easiest solution (and as far as I can tell correct way of handling this)
is that `can_implement` return `false` when beta is non zero (or
whatever argument indicates an epilogue source) but we have a void-C
kernel.
Profiler already includes functionality to skip running kernels that
fail `can_implement`.
* Move checks to collectives instead
---------
Co-authored-by: Ali Hassani <ahassani@nvidia.com>
* It seems that __cplusplus can be inconsistent with _MSVC_LANG when discerning C++17 version. See https://github.com/NVIDIA/cutlass/issues/1474. Added switch to check _MSVC_LANG in addition to __cplusplus
* Fixed typo.
* Oops, another typo.
* Changed incorrect logic, ifndef to ifdef
* Define CUTLAS_CPLUSPLUS for language version testing
Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com>
---------
Co-authored-by: Mark Hoemmen <mhoemmen@users.noreply.github.com>
* add missing header for size_t in `numeric_types.h`
* make nvrtc happy
* add missing header for int types in `cutlass/arch/memory.h`
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* fix uint128 operator add for 64-bit hilo implemenation
* add uint128 test for operator add
* make clang happy
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Allow per-column bias in EpilogueTensorBroadcast
EpilogueTensorBroadcast only supports per-row vector broadcast, because
the bias stride is hardcoded.
It can easily support both if the bias stride is made conditional, and
the original behavior is maintained by defaulting to per-row.
* Add unit test for EpilogueTensorBroadcast with per-col bias
---------
Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Ali Hassani <ali@hippoml.com>