reed
eb01d5449d
fix cp.async L2 prefetch typo ( #1187 )
2023-11-28 16:58:04 -05:00
Jack Kosaian
8098336d51
Updates to Python interface for PyPI packaging ( #1209 )
...
* Updates
* Updates to notebooks
2023-11-28 13:52:12 -05:00
Sergey Klevtsov
b5d8a5d9cc
Allow SM90 pingpong kernel to use custom tile schedulers ( #1194 )
...
Co-authored-by: Sergey Klevtsov <sklevtsov@nvidia.com>
2023-11-15 13:45:17 -05:00
reed
6e60b9b17c
enable L2::128B prefetch for cp.async by default ( #1177 )
2023-11-13 13:30:13 -05:00
Changho Hwang
1ab6cc7b68
Fix std::abs
overloading for bfloat16_t
( #1179 )
2023-11-13 13:29:45 -05:00
Manish Gupta
5ae8133cfa
Doc only change changelog 3.3 ( #1180 )
2023-11-13 13:29:22 -05:00
reed
39c6a83f23
fix missing return warning ( #1173 )
2023-11-03 22:42:59 -04:00
wang-y-z
1d7f2a207e
Fix several broken links ( #1168 )
...
Co-authored-by: isaacw <isaacw@nvidia.com>
2023-11-03 00:01:25 -04:00
wang-y-z
557be3ab0e
Fix several typos ( #1169 )
...
Co-authored-by: isaacw <isaacw@nvidia.com>
2023-11-02 23:54:46 -04:00
Pradeep Ramani
c008b4aea8
CUTLASS 3.3.0 ( #1167 )
...
* Release 3.3.0
Adds support for mixed precision GEMMs On Hopper and Ampere
Adds support for < 16B aligned GEMMs on Hopper
Enhancements to EVT
Enhancements to Python interface
Enhancements to Sub-byte type handling in CuTe
Several other bug-fixes and performance improvements.
* minor doc update
2023-11-02 11:09:05 -04:00
reed
922fb5108b
clean the format ( #1140 )
2023-10-24 22:59:06 -04:00
cyyever
7a7796afae
Fix is_zero ( #1147 )
...
* Fix is_zero
* Use constexpr
* Add CUTLASS_PRAGMA_UNROLL to loops
* Avoid if branches in is_zero
2023-10-23 12:09:37 -04:00
milesvant
fb10fa5308
Fix broken pipeline link in docs ( #1143 )
2023-10-18 12:55:46 -04:00
Haicheng Wu
5e1a0a5adb
fix alignmentC for h16816_s8xf16 ( #1146 )
...
* fix alignmentC for h16816_s8xf16
* manish's change
2023-10-17 15:15:39 -04:00
Manish Gupta
757275f279
Adding more Threadblock Tiles for Mixed-input TensorOp (BF16 * S8) in cutlass_library ( #1132 )
...
* Adding more tiles in the cutlass_library for mixed-input support.
* fix rebase issue
* more tiles to upcast a
2023-10-13 11:33:15 -04:00
reed
fa8dfe631f
fix missing return warning for repeat and axpby ( #1124 )
2023-10-12 00:05:45 -04:00
Jake Hemstad
112590114d
Add config.yml issue template with Discord link. ( #1135 )
2023-10-10 12:13:04 -04:00
Manish Gupta
ff02da2667
Fx parallel split-k ( #1116 )
2023-10-06 12:02:40 -04:00
Krzysztof Lecki
4082fed85a
Add missing int64 and uint64 overloads for conj ( #1127 )
...
Signed-off-by: Krzysztof Lecki <klecki@nvidia.com>
2023-10-05 20:01:44 -04:00
Fabian Schuetze
5f13dcad78
set kIsHeavy member variables ( #1012 )
...
* set kIsHeavy member variables
* correct kIsHeavy value for Tanh
* set kIsHeavy=false for HardSwish
---------
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-10-04 12:38:36 -04:00
Kyle Gerard Felker
61a38f83dc
Add #include <limits> to platform.h ( #1121 )
...
Closes #1118
2023-10-02 21:41:25 -04:00
masahi
ff61a49dd1
Allow changing epsilon parameter in RMS norm kernel ( #1112 )
2023-10-02 20:40:28 -04:00
Lequn Chen
26986bbc60
Fix type typo in rmsnorm ( #1119 )
...
Initially the variable `h4` is `half4`, but its last two fields are not used. Based on the semantics and the context, I believe it should be `half2`.
2023-10-02 20:40:04 -04:00
Manish Gupta
7d8317a63e
Support for Mixed Input TensorOp ( #1084 )
...
* Passing warp-level mixed input F16*(S8/U8) tests
* passing device-level mixed input F16*(S8/U8) tests
* add to profiler - I8 (111 TFLOPs), U (123 TFLOPs)
* fast numeric conversions (I8 = 132 TFLOPs, U8 = 148 TFLOPs)
* Speedup reference compilation (REVERT THIS COMMIT)
* wider_add.u32_packed_sub.f16x2 (I8 = 132TFLOP/s, U8 = 170 TFLOP/s)
* Improve s8->f16 cvt and support bf16*u8 @158 TFLOPs
* BF16 * S8 (142 TFLOPs)
* Handle mixed-input upcast on OperandA (Support [S8|U8]*[F16|BF16]
* rename OpMultiplyAddMixedInput to OpMultiplyAddMixedInputUpcast
* Add device-level test and profiler support for upcast on operand A
* Move shfl before the cvt and reduce #shfls by 1/2
* fix smem_usage calculation for mixed_input types
* uncomment the stuff (getting ready for merge)
* profiler changes and mixed-input reference
* mixed input reference are in a new file
* use platform instead of std
* comments and typo only
* Use CreateGemmOperator and delete CreateMixedInputGemmOperator
* copyright for new files
* rebase follow-up
2023-09-27 11:18:30 -04:00
Manish Gupta
5cd735c48e
Fix Parallel Split-K on Gemm Operation Profiler ( #1109 )
...
* Debug and fix for parallel split-k in profiler
* restore debug files and remove prints
2023-09-26 17:28:00 -04:00
xuhaoran
67ae8e0603
Change the position of minus sign in line1549 array.h ( #1091 )
...
when I use cutlass::epilogue:🧵 :LinearCombinationSigmoid, I encounter the this error:
cutlass/include/cutlass/array.h(1549): error: no operator "-" matches these operands
Moving operator "-" from line 1549 to 1548 can solve this error
2023-09-26 17:26:39 -04:00
ZCHNO
14f69bddc8
[fix] fix comparison operator for integer_subbyte ( #1090 )
2023-09-26 17:26:12 -04:00
ANIKET SHIVAM
90d3b0fb18
CUTLASS 3.2.1 ( #1113 )
...
* Updates for 3.2.1 release.
* Minor fix in gemm op profiler for raster order.
* Add scheduler mapping for raster order in the kernels.
2023-09-26 17:24:26 -04:00
reed
e0aaa3c3b3
fix GmmaDescriptor print format string error ( #1102 )
2023-09-19 23:27:58 -04:00
Vadim Markovtsev
8783c41851
Replace 0x1f with 0xffffffff in __shfl_sync ( #1097 )
...
This fixes compatibility with H100 and resolves #1094
2023-09-18 19:58:19 -04:00
Yujia Zhai
6407bcdf0a
fix matrix B indices ( #1089 )
2023-09-12 14:04:18 -04:00
tpoisonooo
a77b2c9cb8
style(examples): typo ( #1080 )
...
* Update ampere_tensorop_conv2dfprop.cu
learning cutlass, PR a typo.
* Update ampere_gemm_operand_reduction_fusion.cu
2023-09-11 10:13:22 -04:00
ANIKET SHIVAM
34bbadd3ff
standarize fp8 generator ( #1078 )
2023-09-07 14:36:33 -04:00
Driss Guessous
88c0d7c726
make only visible on device ( #1071 )
2023-09-07 13:00:46 -04:00
Vijay Thakkar
e01b9b5029
Shard gemm reference templates into multiple TUs for parallel compilation ( #1043 )
...
* Split apart gemm reference templates into multiple TUs for parallel compilation
* remove old files
* better balancing of ref kernels across TUs
* remove 3 new added refcheck kernels and some un-necessary fp8 library instances to reduce lib size
* remove auto fp8 kernels
* remove some redundant kernels
2023-08-30 16:46:30 -04:00
Aman Gupta Karmani
34fd98056b
fix cinttypes issue with STDC_FORMAT_MACROS ( #1068 )
...
* fix cinttypes issue with STDC_FORMAT_MACROS
* Update mma_sm90_desc.hpp
* Update mma_sm90_desc.hpp
---------
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2023-08-29 14:59:33 -04:00
Ying Zhang
3a8f57a3c8
Add simple hash and eq methods for gemm_operations. ( #1053 )
2023-08-27 20:41:57 -04:00
reed
6673df0e48
fix typos ( #1059 )
2023-08-27 00:49:26 -04:00
Lufang Chen
7618e9bfd8
Fix numeric conversion warning ( #1021 )
...
* fix numeric conversion unused var
* update
---------
Co-authored-by: Lufang CHEN 陈橹方 <lufang.chen@nio.com>
2023-08-27 00:42:44 -04:00
ANIKET SHIVAM
a88c41cf8d
Updates for 3.2 release ( #1065 )
2023-08-25 23:05:46 -04:00
reed
27de343535
Add one Publication which is inspired by cutlass ( #1022 )
2023-08-22 10:00:17 -04:00
Allard Hendriksen
2a9fa23e06
Avoid cute::print compiler warnings with -Wformat-security ( #1041 )
...
Fixes issue #1040 .
2023-08-18 14:38:27 -04:00
zhu jianjiang
2e56cfabee
fix typo ( #1047 )
2023-08-18 14:08:26 -04:00
lorenzo chelini
3930f709ce
Fix typo in 0x_gemm_tutorial.md
( #1035 )
2023-08-17 10:52:20 -04:00
Haibin Lin
7e5ee8b7bf
[doc] fix: fix typos in the comment ( #1049 )
2023-08-16 11:39:25 -04:00
Sophia Wisdom
2d9a557427
torch.bfloat16 support in cutlass python ( #1037 )
...
* torch.bfloat16 support in cutlass python
* Update datatypes.py
2023-08-16 11:38:53 -04:00
ANIKET SHIVAM
4575443d44
CUTLASS 3.2 ( #1024 )
...
* CUTLASS 3.2
2023-08-07 20:50:32 -04:00
Xianyao Zhang
a0d787b746
Fix one publication ( #1019 )
2023-07-28 11:40:17 -04:00
Sophia Wisdom
d20f3a9542
spelling ( #1007 )
...
logicial -> logical
2023-07-20 14:41:11 -04:00
Tianqi Zhang (张天启)
8e85580859
fix layout bug ( #1006 )
2023-07-19 14:26:01 -04:00