Commit Graph

196 Commits

Author SHA1 Message Date
Gregory Meyer (gregjm)
f60786b536
Remove undefined behavior from default constructor of PredicatedTileAccessIteratorParams. (#1258)
Currently, the default constructor of
`PredicatedTileAccessIteratorParams` will invoke undefined behavior in
its invocation of the `initialize` function. Specifically, it will
attempt to read from the uninitialized variables
`desc.element_size_bits` and `desc.advance_rank`. This commit changes
the default constructors of both `*Params` and `*Desc` to
zero-initialize all uninitialized members.
2023-12-11 23:01:53 -05:00
Christian Sigg
e1483d5fa0
Collection of changes to fix clang build. (#1200)
* Remove unused variables

* Qualify calls to make_fragment_? from templated base class.

Fixes clang build error.

* Add missing `#include <cstdio>`

* Various changes to fix clang compile errors.

* More changes to fix clang build.

Remaining issues:

- `params` initializer of `CollectiveEpilogue`.
- `ops` initializer of `Sm90VisitorImplBase`.
- `__usAtomicCAS` needs to be added to clang upstream.

* Fix remaining clang build issues.

* Qualify `cute::rank()` calls.

* Qualify some more calls that are otherwise ambiguous between `cute` and `std` namespace.

* Double-escape special registers in inline asm.

* small change

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-12-08 14:42:12 -05:00
Ali Hassani
f4a0216601
Fix bug in single source GEMM with residual + streamk (#1249)
Followup to #1224.

A change in the stream-k threadblock swizzle ctor since 3.3 breaks
single source GEMM with fused epilogue and stream-k. Multi-source was
already corrected.

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
2023-12-07 11:12:02 -05:00
Ali Hassani
a75b4ac483
Fix Stream-K reduce bug in epilogue with broadcast (#1224)
Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
2023-12-05 15:35:41 -05:00
Pradeep Ramani
e9e30c2304
Updates and Bug fixes to CUTLASS 3.3 (#1232) 2023-12-05 09:50:49 -05:00
Haicheng Wu
4a1709e17e
Fixed illegal PTX syntax (#1225) 2023-12-01 12:29:48 -05:00
Christian Sigg
bef1fbcbe6
Add missing #include <cstdio> (#1197)
* Add missing `#include <cstdio>`

* move to non nvrtc part

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-12-01 11:58:53 -05:00
Christian Sigg
2375a07d01
Qualify calls to make_fragment_? from templated base class. (#1196)
Fixes clang build error.
2023-12-01 09:52:57 -05:00
cyyever
10b850f9c7
Fix some sign conversion warnings (#1172)
* Fix sign conversion warnings

* Fix type conversion warnings

* Fix sign conversion warnings

* Change smem_size_ to constexpr

* clang warnings

* undo cast change

* one miss change

* missing part

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-11-30 00:28:40 -05:00
Christian Sigg
99c4eebe3b
Explicitly cast blockIdx to uint3 (#1192)
This works around a clang issue where blockIdx is of a different type.
2023-11-30 00:26:23 -05:00
reed
eb01d5449d
fix cp.async L2 prefetch typo (#1187) 2023-11-28 16:58:04 -05:00
Sergey Klevtsov
b5d8a5d9cc
Allow SM90 pingpong kernel to use custom tile schedulers (#1194)
Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>
2023-11-15 13:45:17 -05:00
reed
6e60b9b17c
enable L2::128B prefetch for cp.async by default (#1177) 2023-11-13 13:30:13 -05:00
Changho Hwang
1ab6cc7b68
Fix std::abs overloading for bfloat16_t (#1179) 2023-11-13 13:29:45 -05:00
reed
39c6a83f23
fix missing return warning (#1173) 2023-11-03 22:42:59 -04:00
wang-y-z
557be3ab0e
Fix several typos (#1169)
Co-authored-by: isaacw <isaacw@nvidia.com>
2023-11-02 23:54:46 -04:00
Pradeep Ramani
c008b4aea8
CUTLASS 3.3.0 (#1167)
* Release 3.3.0

Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.

* minor doc update
2023-11-02 11:09:05 -04:00
reed
922fb5108b
clean the format (#1140) 2023-10-24 22:59:06 -04:00
cyyever
7a7796afae
Fix is_zero (#1147)
* Fix is_zero

* Use constexpr

* Add CUTLASS_PRAGMA_UNROLL to loops

* Avoid if branches in is_zero
2023-10-23 12:09:37 -04:00
reed
fa8dfe631f
fix missing return warning for repeat and axpby (#1124) 2023-10-12 00:05:45 -04:00
Krzysztof Lecki
4082fed85a
Add missing int64 and uint64 overloads for conj (#1127)
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
2023-10-05 20:01:44 -04:00
Fabian Schuetze
5f13dcad78
set kIsHeavy member variables (#1012)
* set kIsHeavy member variables

* correct kIsHeavy value for Tanh

* set kIsHeavy=false for HardSwish

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-10-04 12:38:36 -04:00
Kyle Gerard Felker
61a38f83dc
Add #include <limits> to platform.h (#1121)
Closes #1118
2023-10-02 21:41:25 -04:00
Manish Gupta
7d8317a63e
Support for Mixed Input TensorOp (#1084)
* Passing warp-level mixed input F16*(S8/U8) tests

* passing device-level mixed input F16*(S8/U8) tests

* add to profiler - I8 (111 TFLOPs), U (123 TFLOPs)

* fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs)

* Speedup reference compilation (REVERT THIS COMMIT)

* wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s)

* Improve s8->f16 cvt and support bf16*u8 @158 TFLOPs

* BF16 * S8 (142 TFLOPs)

* Handle mixed-input upcast on OperandA (Support [S8|U8]*[F16|BF16]

* rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast

* Add device-level test and profiler support for upcast on operand A

* Move shfl before the cvt and reduce #shfls by 1/2

* fix smem_usage calculation for mixed_input types

* uncomment the stuff (getting ready for merge)

* profiler changes and mixed-input reference

* mixed input reference are in a new file

* use platform instead of std

* comments and typo only

* Use CreateGemmOperator and delete CreateMixedInputGemmOperator

* copyright for new files

* rebase follow-up
2023-09-27 11:18:30 -04:00
xuhaoran
67ae8e0603
Change the position of minus sign in line1549 array.h (#1091)
when I use cutlass::epilogue:🧵:LinearCombinationSigmoid, I encounter the this error:
cutlass/include/cutlass/array.h(1549): error: no operator "-" matches these operands
Moving  operator "-" from line 1549 to 1548 can solve this error
2023-09-26 17:26:39 -04:00
ZCHNO
14f69bddc8
[fix] fix comparison operator for integer_subbyte (#1090) 2023-09-26 17:26:12 -04:00
ANIKET SHIVAM
90d3b0fb18
CUTLASS 3.2.1 (#1113)
* Updates for 3.2.1 release.

* Minor fix in gemm op profiler for raster order.

* Add scheduler mapping for raster order in the kernels.
2023-09-26 17:24:26 -04:00
reed
e0aaa3c3b3
fix GmmaDescriptor print format string error (#1102) 2023-09-19 23:27:58 -04:00
Driss Guessous
88c0d7c726
make only visible on device (#1071) 2023-09-07 13:00:46 -04:00
Aman Gupta Karmani
34fd98056b
fix cinttypes issue with STDC_FORMAT_MACROS (#1068)
* fix cinttypes issue with STDC_FORMAT_MACROS

* Update mma_sm90_desc.hpp

* Update mma_sm90_desc.hpp

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2023-08-29 14:59:33 -04:00
reed
6673df0e48
fix typos (#1059) 2023-08-27 00:49:26 -04:00
Lufang Chen
7618e9bfd8
Fix numeric conversion warning (#1021)
* fix numeric conversion unused var

* update

---------

Co-authored-by: Lufang CHEN 陈橹方 <lufang.chen@nio.com>
2023-08-27 00:42:44 -04:00
ANIKET SHIVAM
a88c41cf8d
Updates for 3.2 release (#1065) 2023-08-25 23:05:46 -04:00
Allard Hendriksen
2a9fa23e06
Avoid cute::print compiler warnings with -Wformat-security (#1041)
Fixes issue #1040.
2023-08-18 14:38:27 -04:00
Haibin Lin
7e5ee8b7bf
[doc] fix: fix typos in the comment (#1049) 2023-08-16 11:39:25 -04:00
ANIKET SHIVAM
4575443d44
CUTLASS 3.2 (#1024)
* CUTLASS 3.2
2023-08-07 20:50:32 -04:00
Sophia Wisdom
d20f3a9542
spelling (#1007)
logicial -> logical
2023-07-20 14:41:11 -04:00
ChangyouSiom
e066ced33b
fix epilogue iterator error (#995)
* fix epilogue iterator error

* fix epilogue iterator error

---------

Co-authored-by: maxiao <maxiao@cowarobot.com>
2023-07-10 21:30:31 -04:00
Jack Kosaian
87349d3496
Add grouped b2b GEMM (#970) 2023-06-05 17:16:57 -04:00
Jack Kosaian
7dbf423763
Add conversion from ElementBias to ElementCompute (#961) 2023-05-26 23:08:36 -04:00
Aleksandar Samardžić
d3e72719b4
Add support for sparse GEMM with row broadcasted bias vector (#951) 2023-05-24 10:25:05 -04:00
ANIKET SHIVAM
f079619f5e
More updates for 3.1 (#958)
* Updates for 3.1

* Minor change

* doc link fix

* Minor updates
2023-05-24 10:17:16 -04:00
Ali Hassani
13f413493a
Stream-K with broadcast (#892)
* [WIP] GEMM StreamK w/ Fused Epilogue

* Adds Gemm Streamk with Fused Epilogue kernel level struct.
  * Mostly based on Gemm with Fused Epilogue,
  * Requires a new epilogue
  * Work in progress

* [WIP] StreamK support for GemmUniversalWithBroadcast

* Just based off of how StreamK is allowed in GemmUniversal
  * Untested and a work in progress

* Minor fixes

* [WIP] It compiles!

It is almost certainly incorrect, but we're past getting the templates
to match, so checkpointing.

* Correction to reference kernel

* Fix typo

* Added MSE measurement

* Switch back to reference kernel + host for loop

Still WIP. Now we're getting even a larger MSE, but it's both on
basic Split-K and Stream-K.

* Fix typos

* Fix broadcast vector + requested changes

* Comment typo

* Small int option and more

* Fix incorrect condition on source needed

* Requested changes

* I think I got it?

* Bias vector should be stride 0

* Two source added!

* Typos

* Merge examples

* Bring back vector row offset

Just to ensure consistency with universal gemm with fused epilogue

* Base arguments and params structs for StreamK

* StreamK epilogue with broadcast now inherits the original

* undo params_streamk_base.h

---------

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-22 19:05:06 -04:00
wll
19c4a4815e
replace division with multiplication in GELU (#942) 2023-05-12 10:57:18 -04:00
Gregory Meyer (gregjm)
fcfbd23e26
Fix host compilation of cute::cast_smem_ptr_to_uint. (#940)
* Remove references to device-only intrinsics when compiling for host.

Currently, we attempt to use the `__device__`-only functions
`__cvta_generic_to_shared` and `__nvvm_get_smem_pointer` when compiling
`cute::cast_smem_ptr_to_uint` for the host on Clang. This results in a
compilation error, as expected. This commit changes the definition of
the `*_ACTIVATED` macros so that they are only true when `__CUDA_ARCH__`
is defined; that is, when compiling for the device.

Additionally, the declaration of `__nvvm_get_smem_pointer`
is currently only visible during the device compilation pass when
compiling with NVCC; this commit makes the declaration visible during
host compilation with the `__device__` annotation.

* Annotate cute::cast_smem_ptr_to_uint as device-only.

The implementation of `cute::cast_smem_ptr_to_uint` is currently an
unchecked failure on host code, and the only host implementation I can
think of -- casting a probably-64-bit pointer to 32 bits somehow --
doesn't make sense to implement. This commit marks this function as
device-only so that it can't be accidentally used on host code.

* small change

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-10 00:06:54 -04:00
Gregory Meyer (gregjm)
b250faccd3
Make operator() const-correct and add missing static functions. (#936)
* Make operator() const-correct and add missing static functions.

Currently, `*Converter::operator()` requires a mutable object to invoke,
and there are missing `static result_type convert(source_type const &
source)` overloads for certain partial specializations of `*Converter`
objects. This commit makes `operator()` const-correct and adds missing
function overloads where appropriate.

* minor changes

* format

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-09 16:33:01 -04:00
Janusz Lisiecki
24c8b7d8a2
Fix cuTE compilation with clang (#939)
- clang 1.14 complains about missing function from a host call:
  cutlass/include/cute/arch/util.hpp:106:32: error: no matching function for call to '__cvta_generic_to_shared'
  return static_cast<uint32_t>(__cvta_generic_to_shared(ptr));
- fixes this by defining CUTE_HOST_DEVICE for clang as well

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
2023-05-09 09:51:45 -04:00
ANIKET SHIVAM
7c04f95415
Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
Gregory Meyer (gregjm)
6f8596ce3f
Add missing #include directive to get access to cutlass::epilogue:🧵:ScaleType. (#925)
Currently, the `LinearCombinationClamp` header file is not standalone,
and must have the definition of `cutlass::epilogue:🧵:ScaleType`
already available when it is `#include`d.
2023-04-28 20:02:41 -04:00
Adnan Akhundov
fe2f491dd7
Get SM count with cudaDeviceGetAttribute in KernelHardwareInfo (#927) 2023-04-28 13:23:23 -04:00