Commit Graph

432 Commits

Author SHA1 Message Date
Jake Hemstad
112590114d
Add config.yml issue template with Discord link. (#1135) 2023-10-10 12:13:04 -04:00
Manish Gupta
ff02da2667
Fx parallel split-k (#1116) 2023-10-06 12:02:40 -04:00
Krzysztof Lecki
4082fed85a
Add missing int64 and uint64 overloads for conj (#1127)
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
2023-10-05 20:01:44 -04:00
Fabian Schuetze
5f13dcad78
set kIsHeavy member variables (#1012)
* set kIsHeavy member variables

* correct kIsHeavy value for Tanh

* set kIsHeavy=false for HardSwish

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-10-04 12:38:36 -04:00
Kyle Gerard Felker
61a38f83dc
Add #include <limits> to platform.h (#1121)
Closes #1118
2023-10-02 21:41:25 -04:00
masahi
ff61a49dd1
Allow changing epsilon parameter in RMS norm kernel (#1112) 2023-10-02 20:40:28 -04:00
Lequn Chen
26986bbc60
Fix type typo in rmsnorm (#1119)
Initially the variable `h4` is `half4`, but its last two fields are not used. Based on the semantics and the context, I believe it should be `half2`.
2023-10-02 20:40:04 -04:00
Manish Gupta
7d8317a63e
Support for Mixed Input TensorOp (#1084)
* Passing warp-level mixed input F16*(S8/U8) tests

* passing device-level mixed input F16*(S8/U8) tests

* add to profiler - I8 (111 TFLOPs), U (123 TFLOPs)

* fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs)

* Speedup reference compilation (REVERT THIS COMMIT)

* wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s)

* Improve s8->f16 cvt and support bf16*u8 @158 TFLOPs

* BF16 * S8 (142 TFLOPs)

* Handle mixed-input upcast on OperandA (Support [S8|U8]*[F16|BF16]

* rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast

* Add device-level test and profiler support for upcast on operand A

* Move shfl before the cvt and reduce #shfls by 1/2

* fix smem_usage calculation for mixed_input types

* uncomment the stuff (getting ready for merge)

* profiler changes and mixed-input reference

* mixed input reference are in a new file

* use platform instead of std

* comments and typo only

* Use CreateGemmOperator and delete CreateMixedInputGemmOperator

* copyright for new files

* rebase follow-up
2023-09-27 11:18:30 -04:00
Manish Gupta
5cd735c48e
Fix Parallel Split-K on Gemm Operation Profiler (#1109)
* Debug and fix for parallel split-k in profiler

* restore debug files and remove prints
2023-09-26 17:28:00 -04:00
xuhaoran
67ae8e0603
Change the position of minus sign in line1549 array.h (#1091)
when I use cutlass::epilogue:🧵:LinearCombinationSigmoid, I encounter the this error:
cutlass/include/cutlass/array.h(1549): error: no operator "-" matches these operands
Moving  operator "-" from line 1549 to 1548 can solve this error
2023-09-26 17:26:39 -04:00
ZCHNO
14f69bddc8
[fix] fix comparison operator for integer_subbyte (#1090) 2023-09-26 17:26:12 -04:00
ANIKET SHIVAM
90d3b0fb18
CUTLASS 3.2.1 (#1113)
* Updates for 3.2.1 release.

* Minor fix in gemm op profiler for raster order.

* Add scheduler mapping for raster order in the kernels.
2023-09-26 17:24:26 -04:00
reed
e0aaa3c3b3
fix GmmaDescriptor print format string error (#1102) 2023-09-19 23:27:58 -04:00
Vadim Markovtsev
8783c41851
Replace 0x1f with 0xffffffff in __shfl_sync (#1097)
This fixes compatibility with H100 and resolves #1094
2023-09-18 19:58:19 -04:00
Yujia Zhai
6407bcdf0a
fix matrix B indices (#1089) 2023-09-12 14:04:18 -04:00
tpoisonooo
a77b2c9cb8
style(examples): typo (#1080)
* Update ampere_tensorop_conv2dfprop.cu

learning cutlass, PR a typo.

* Update ampere_gemm_operand_reduction_fusion.cu
2023-09-11 10:13:22 -04:00
ANIKET SHIVAM
34bbadd3ff
standarize fp8 generator (#1078) 2023-09-07 14:36:33 -04:00
Driss Guessous
88c0d7c726
make only visible on device (#1071) 2023-09-07 13:00:46 -04:00
Vijay Thakkar
e01b9b5029
Shard gemm reference templates into multiple TUs for parallel compilation (#1043)
* Split apart gemm reference templates into multiple TUs for parallel compilation

* remove old files

* better balancing of ref kernels across TUs

* remove 3 new added refcheck kernels and some un-necessary fp8 library instances to reduce lib size

* remove auto fp8 kernels

* remove some redundant kernels
2023-08-30 16:46:30 -04:00
Aman Gupta Karmani
34fd98056b
fix cinttypes issue with STDC_FORMAT_MACROS (#1068)
* fix cinttypes issue with STDC_FORMAT_MACROS

* Update mma_sm90_desc.hpp

* Update mma_sm90_desc.hpp

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2023-08-29 14:59:33 -04:00
Ying Zhang
3a8f57a3c8
Add simple hash and eq methods for gemm_operations. (#1053) 2023-08-27 20:41:57 -04:00
reed
6673df0e48
fix typos (#1059) 2023-08-27 00:49:26 -04:00
Lufang Chen
7618e9bfd8
Fix numeric conversion warning (#1021)
* fix numeric conversion unused var

* update

---------

Co-authored-by: Lufang CHEN 陈橹方 <lufang.chen@nio.com>
2023-08-27 00:42:44 -04:00
ANIKET SHIVAM
a88c41cf8d
Updates for 3.2 release (#1065) 2023-08-25 23:05:46 -04:00
reed
27de343535
Add one Publication which is inspired by cutlass (#1022) 2023-08-22 10:00:17 -04:00
Allard Hendriksen
2a9fa23e06
Avoid cute::print compiler warnings with -Wformat-security (#1041)
Fixes issue #1040.
2023-08-18 14:38:27 -04:00
zhu jianjiang
2e56cfabee
fix typo (#1047) 2023-08-18 14:08:26 -04:00
lorenzo chelini
3930f709ce
Fix typo in 0x_gemm_tutorial.md (#1035) 2023-08-17 10:52:20 -04:00
Haibin Lin
7e5ee8b7bf
[doc] fix: fix typos in the comment (#1049) 2023-08-16 11:39:25 -04:00
Sophia Wisdom
2d9a557427
torch.bfloat16 support in cutlass python (#1037)
* torch.bfloat16 support in cutlass python

* Update datatypes.py
2023-08-16 11:38:53 -04:00
ANIKET SHIVAM
4575443d44
CUTLASS 3.2 (#1024)
* CUTLASS 3.2
2023-08-07 20:50:32 -04:00
Xianyao Zhang
a0d787b746
Fix one publication (#1019) 2023-07-28 11:40:17 -04:00
Sophia Wisdom
d20f3a9542
spelling (#1007)
logicial -> logical
2023-07-20 14:41:11 -04:00
Tianqi Zhang (张天启)
8e85580859
fix layout bug (#1006) 2023-07-19 14:26:01 -04:00
dan_the_3rd
146d314057
Update fMHA kernels (#992)
* Update fMHA kernels

Upstream recent changes to fMHA that we did in xFormers.
Previous version in CUTLASS: facebookresearch/xformers@b6be33a
Updating to: facebookresearch/xformers@55a4798

* minor changes

* make var work

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-07-12 22:30:46 -04:00
masahi
f679663224
Add RMS norm (#979) 2023-07-10 21:31:27 -04:00
ChangyouSiom
e066ced33b
fix epilogue iterator error (#995)
* fix epilogue iterator error

* fix epilogue iterator error

---------

Co-authored-by: maxiao <maxiao@cowarobot.com>
2023-07-10 21:30:31 -04:00
Nathan Wang
9b923dd4c4
fix minor typos (#984) 2023-07-05 09:23:01 -04:00
q.yao
f6d42f2dd0
add library_dirs (#977) 2023-06-14 12:09:12 -04:00
ANIKET SHIVAM
473a67073e
Fix Int8 and TF32 generator (#976) 2023-06-12 12:32:52 -04:00
Jack Kosaian
87349d3496
Add grouped b2b GEMM (#970) 2023-06-05 17:16:57 -04:00
Vijay Thakkar
fde824af21
Update Hopper performance plot for CUTLASS 3.1 + CTK 12.1 (#967) 2023-06-01 14:52:40 -04:00
Jack Kosaian
7dbf423763
Add conversion from ElementBias to ElementCompute (#961) 2023-05-26 23:08:36 -04:00
Haicheng Wu
6f47420213
Update README.md 2023-05-24 12:40:31 -04:00
Haicheng Wu
4638250469
Update CHANGELOG.md 2023-05-24 12:39:42 -04:00
Haicheng Wu
7859fe322a
Update PUBLICATIONS.md 2023-05-24 12:36:12 -04:00
Aleksandar Samardžić
d3e72719b4
Add support for sparse GEMM with row broadcasted bias vector (#951) 2023-05-24 10:25:05 -04:00
Ali Hassani
b4ab501767
Adds CUDA path for x86-64 (#957) 2023-05-24 10:21:25 -04:00
ANIKET SHIVAM
f079619f5e
More updates for 3.1 (#958)
* Updates for 3.1

* Minor change

* doc link fix

* Minor updates
2023-05-24 10:17:16 -04:00
Ali Hassani
13f413493a
Stream-K with broadcast (#892)
* [WIP] GEMM StreamK w/ Fused Epilogue

* Adds Gemm Streamk with Fused Epilogue kernel level struct.
  * Mostly based on Gemm with Fused Epilogue,
  * Requires a new epilogue
  * Work in progress

* [WIP] StreamK support for GemmUniversalWithBroadcast

* Just based off of how StreamK is allowed in GemmUniversal
  * Untested and a work in progress

* Minor fixes

* [WIP] It compiles!

It is almost certainly incorrect, but we're past getting the templates
to match, so checkpointing.

* Correction to reference kernel

* Fix typo

* Added MSE measurement

* Switch back to reference kernel + host for loop

Still WIP. Now we're getting even a larger MSE, but it's both on
basic Split-K and Stream-K.

* Fix typos

* Fix broadcast vector + requested changes

* Comment typo

* Small int option and more

* Fix incorrect condition on source needed

* Requested changes

* I think I got it?

* Bias vector should be stride 0

* Two source added!

* Typos

* Merge examples

* Bring back vector row offset

Just to ensure consistency with universal gemm with fused epilogue

* Base arguments and params structs for StreamK

* StreamK epilogue with broadcast now inherits the original

* undo params_streamk_base.h

---------

Co-authored-by: Ali Hassani <ahassanijr@gmail.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-05-22 19:05:06 -04:00