Commit Graph

403 Commits

Author SHA1 Message Date
Sophia Wisdom
2d9a557427
torch.bfloat16 support in cutlass python (#1037)
* torch.bfloat16 support in cutlass python

* Update datatypes.py
2023-08-16 11:38:53 -04:00
ANIKET SHIVAM
4575443d44
CUTLASS 3.2 (#1024)
* CUTLASS 3.2
2023-08-07 20:50:32 -04:00
Xianyao Zhang
a0d787b746
Fix one publication (#1019) 2023-07-28 11:40:17 -04:00
Sophia Wisdom
d20f3a9542
spelling (#1007)
logicial -> logical
2023-07-20 14:41:11 -04:00
Tianqi Zhang (张天启)
8e85580859
fix layout bug (#1006) 2023-07-19 14:26:01 -04:00
dan_the_3rd
146d314057
Update fMHA kernels (#992)
* Update fMHA kernels

Upstream recent changes to fMHA that we did in xFormers.
Previous version in CUTLASS: facebookresearch/xformers@b6be33a
Updating to: facebookresearch/xformers@55a4798

* minor changes

* make var work

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-07-12 22:30:46 -04:00
masahi
f679663224
Add RMS norm (#979) 2023-07-10 21:31:27 -04:00
ChangyouSiom
e066ced33b
fix epilogue iterator error (#995)
* fix epilogue iterator error

* fix epilogue iterator error

---------

Co-authored-by: maxiao <maxiao@cowarobot.com>
2023-07-10 21:30:31 -04:00
Nathan Wang
9b923dd4c4
fix minor typos (#984) 2023-07-05 09:23:01 -04:00
q.yao
f6d42f2dd0
add library_dirs (#977) 2023-06-14 12:09:12 -04:00
ANIKET SHIVAM
473a67073e
Fix Int8 and TF32 generator (#976) 2023-06-12 12:32:52 -04:00
Jack Kosaian
87349d3496
Add grouped b2b GEMM (#970) 2023-06-05 17:16:57 -04:00
Vijay Thakkar
fde824af21
Update Hopper performance plot for CUTLASS 3.1 + CTK 12.1 (#967) 2023-06-01 14:52:40 -04:00
Jack Kosaian
7dbf423763
Add conversion from ElementBias to ElementCompute (#961) 2023-05-26 23:08:36 -04:00
Haicheng Wu
6f47420213
Update README.md 2023-05-24 12:40:31 -04:00
Haicheng Wu
4638250469
Update CHANGELOG.md 2023-05-24 12:39:42 -04:00
Haicheng Wu
7859fe322a
Update PUBLICATIONS.md 2023-05-24 12:36:12 -04:00
Aleksandar Samardžić
d3e72719b4
Add support for sparse GEMM with row broadcasted bias vector (#951) 2023-05-24 10:25:05 -04:00
Ali Hassani
b4ab501767
Adds CUDA path for x86-64 (#957) 2023-05-24 10:21:25 -04:00
ANIKET SHIVAM
f079619f5e
More updates for 3.1 (#958)
* Updates for 3.1

* Minor change

* doc link fix

* Minor updates
2023-05-24 10:17:16 -04:00
Ali Hassani
13f413493a
Stream-K with broadcast (#892)
* [WIP] GEMM StreamK w/ Fused Epilogue

* Adds Gemm Streamk with Fused Epilogue kernel level struct.
  * Mostly based on Gemm with Fused Epilogue,
  * Requires a new epilogue
  * Work in progress

* [WIP] StreamK support for GemmUniversalWithBroadcast

* Just based off of how StreamK is allowed in GemmUniversal
  * Untested and a work in progress

* Minor fixes

* [WIP] It compiles!

It is almost certainly incorrect, but we're past getting the templates
to match, so checkpointing.

* Correction to reference kernel

* Fix typo

* Added MSE measurement

* Switch back to reference kernel + host for loop

Still WIP. Now we're getting even a larger MSE, but it's both on
basic Split-K and Stream-K.

* Fix typos

* Fix broadcast vector + requested changes

* Comment typo

* Small int option and more

* Fix incorrect condition on source needed

* Requested changes

* I think I got it?

* Bias vector should be stride 0

* Two source added!

* Typos

* Merge examples

* Bring back vector row offset

Just to ensure consistency with universal gemm with fused epilogue

* Base arguments and params structs for StreamK

* StreamK epilogue with broadcast now inherits the original

* undo params_streamk_base.h

---------

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-22 19:05:06 -04:00
Haicheng Wu
6fbc0d3380
Update layout.md 2023-05-17 20:12:58 -04:00
Manish Gupta
b97404837e
Adding 128x256 tile for 16b input datatype WGMMA gemm (#950) 2023-05-17 17:13:23 -04:00
Haicheng Wu
e2953d47c5
Update gemm_api.md 2023-05-12 15:37:31 -04:00
wll
19c4a4815e
replace division with multiplication in GELU (#942) 2023-05-12 10:57:18 -04:00
Gregory Meyer (gregjm)
fcfbd23e26
Fix host compilation of cute::cast_smem_ptr_to_uint. (#940)
* Remove references to device-only intrinsics when compiling for host.

Currently, we attempt to use the `__device__`-only functions
`__cvta_generic_to_shared` and `__nvvm_get_smem_pointer` when compiling
`cute::cast_smem_ptr_to_uint` for the host on Clang. This results in a
compilation error, as expected. This commit changes the definition of
the `*_ACTIVATED` macros so that they are only true when `__CUDA_ARCH__`
is defined; that is, when compiling for the device.

Additionally, the declaration of `__nvvm_get_smem_pointer`
is currently only visible during the device compilation pass when
compiling with NVCC; this commit makes the declaration visible during
host compilation with the `__device__` annotation.

* Annotate cute::cast_smem_ptr_to_uint as device-only.

The implementation of `cute::cast_smem_ptr_to_uint` is currently an
unchecked failure on host code, and the only host implementation I can
think of -- casting a probably-64-bit pointer to 32 bits somehow --
doesn't make sense to implement. This commit marks this function as
device-only so that it can't be accidentally used on host code.

* small change

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-10 00:06:54 -04:00
Gregory Meyer (gregjm)
b250faccd3
Make operator() const-correct and add missing static functions. (#936)
* Make operator() const-correct and add missing static functions.

Currently, `*Converter::operator()` requires a mutable object to invoke,
and there are missing `static result_type convert(source_type const &
source)` overloads for certain partial specializations of `*Converter`
objects. This commit makes `operator()` const-correct and adds missing
function overloads where appropriate.

* minor changes

* format

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-09 16:33:01 -04:00
Janusz Lisiecki
24c8b7d8a2
Fix cuTE compilation with clang (#939)
- clang 1.14 complains about missing function from a host call:
  cutlass/include/cute/arch/util.hpp:106:32: error: no matching function for call to '__cvta_generic_to_shared'
  return static_cast<uint32_t>(__cvta_generic_to_shared(ptr));
- fixes this by defining CUTE_HOST_DEVICE for clang as well

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
2023-05-09 09:51:45 -04:00
ANIKET SHIVAM
7c04f95415
Updates for 3.1 (#932) 2023-04-29 09:34:27 -04:00
Gregory Meyer (gregjm)
6f8596ce3f
Add missing #include directive to get access to cutlass::epilogue:🧵:ScaleType. (#925)
Currently, the `LinearCombinationClamp` header file is not standalone,
and must have the definition of `cutlass::epilogue:🧵:ScaleType`
already available when it is `#include`d.
2023-04-28 20:02:41 -04:00
Adnan Akhundov
fe2f491dd7
Get SM count with cudaDeviceGetAttribute in KernelHardwareInfo (#927) 2023-04-28 13:23:23 -04:00
Adnan Akhundov
df02482f1d
Add missing schedules argument in SM90 fp16 op generation (#920) 2023-04-26 16:44:49 -04:00
Jakub Szuppe
180c5629bf
Add missing checks for NVRTC in CuTe (#921) 2023-04-25 12:52:43 -04:00
Alexander Zinoviev
e36912f961
Fix for dangling references in the MHA example (#918) 2023-04-19 21:35:46 -04:00
Jack Kosaian
9a83bd3381
CUTLASS 3.1 Python interface documentation (#917)
* Add 12.1 Dockerfile

* Add 3.1 docs
2023-04-18 15:11:35 -04:00
Adnan Akhundov
54bebe417d
Fix some typos in CuTe tutorials (#912) 2023-04-17 16:00:51 -04:00
Guray Ozen
43cfbe0086
Allow L2 prefect for clang compiler (#914) 2023-04-15 01:23:22 -04:00
Aleksandr Pivovar
4a68cf748e
added support of b2b bmm (#849)
* added support of b2b bmm

* fixed arguments and params structures

* added batch_count argument

* removed SplitKSerial and added new test case with b2b bmm

* fixed support of Kbatched and added new test case with batch stride

* added batch support for bias and scale

* make test

* small changes

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-04-14 23:20:02 -04:00
ANIKET SHIVAM
d572cc1aab
CUTLASS 3.1 (#915)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-04-14 23:19:34 -04:00
dan_the_3rd
9b8166e3f0
fMHA: Add backward pass (#844)
* fMHA: Add backward pass

* Better checks for strides/alignments

* Remove fb-internal URL

* torch.Tensor.untyped_storage requires pytorch 2.0+

* minor changes

* make test

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-04-06 20:44:58 -04:00
Shuai Shao
e2d439ee7e
Add tile_n=32 and tile_k=32 kernels in generator.py (#858) 2023-04-06 10:00:52 -04:00
Adnan Akhundov
0435979f59
Remove const from 3.x GemmUniversalAdapter::operator() (#905) 2023-04-03 20:30:51 -04:00
Adnan Akhundov
2ba1ef10be
Increase max dynamic SMEM size in GemmSoftmax (#903) 2023-04-03 10:01:12 -04:00
Adnios
0964bdb64c
update gemm and conv2d cmdline --help output (#878) 2023-04-01 11:38:13 -04:00
Gregory Meyer (gregjm)
ecbd24566c
Enable shared memory intrinsics and ldmatrix PTX on Clang. (#754)
* Enable shared memory intrinsics and ldmatrix PTX on Clang.

This commit adds preprocessor checks to enable the shared memory
intrinsics `__cvta_generic_to_shared` and `__nvvm_get_smem_pointer`, as
well as the `ldmatrix` PTX instructions, on Clang. Preventing these
intrinsics from being used is a significant latency regression on Clang.

* refine the macro

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-31 21:42:24 -04:00
Manish Gupta
660a05f581
fix split_k_mode and add reduction kernel for f16 input/accum/output (#896) 2023-03-30 15:31:08 -04:00
Feng Shijie
bc36122c3f
[layout] Fix AffineRank2ColumnMajor::packed() (#879)
* [layout] Fix AffineRank2ColumnMajor::packed()

* correct affine2row::packed

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-03-29 11:59:48 -04:00
Vijay Thakkar
15d9d31f1f
CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (#897) 2023-03-29 10:42:40 -04:00
ptrblck
1eef5c3cf1
add guards for __CUDA_ARCH__ >= 530 (#891)
* add guards for sm>=70

* drop guard to 530
2023-03-28 17:47:10 -04:00
Yujia Zhai
87070b6d51
add a CUTLASS publication (#893)
* add bytetransformer

* update arxiv link

* re-order
2023-03-28 17:06:57 -04:00