Jongsoo Park
1db6971a8d
Remove unused gemm_k_iterations in GemmKernel::Params ( #406 )
...
Otherwise we get gemm_k_iterations is uninitialized warnings.
2022-02-16 09:52:45 -05:00
Bing Xu
d0d941efc7
[hardswish] correct implmentation ( #403 )
...
* [hardswish] correct implmentation
* seems working
* hardswish fp32/fp16x2 optimization
* [relu] half2 support
* add relu0; add multiply_add_relu0;
* cleanup
Co-authored-by: Bing Xu <bingxu@fb.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-02-09 14:28:53 -05:00
Andrew Kerr
8a951b2940
Enable convolution with fused epilogue for Volta Tensor Cores ( #402 )
...
* Enabled convolution with epilogue fusion for Volta Tensor Cores.
* Compilation fixes
* Disabled testing Volta on Ampere architectures.
2022-01-30 23:24:50 -05:00
masahi
c2ee13a0fe
Add epilogue functor for residual block fusion ( #391 )
...
* Add epilogue functor for residual block fusion
* Do not run split-k tests when ActivationOp is not Identity
* explain TestSplitK param
* return early
2021-12-29 22:53:40 -05:00
masahi
dceabd4c5a
Support half precision sigmoid activation ( #378 )
...
* Support half precision sigmoid activation
* introduce a vectorized variant using fast_tanh
* move the math to fast_math.h
* fixed compile
* .raw() -> .to_half()
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2021-12-22 14:45:06 -05:00
Andrew Kerr
288af365db
Added missing synchronization to avoid WAR hazards between tiles. ( #386 )
2021-12-20 08:34:08 -08:00
masahi
0dc3ba60b3
Refactor GELU and Sigmoid epilogue to use a common template (and add SiLu, Hardswish epilogue) ( #379 )
...
* Support half precision sigmoid activation
* introduce a vectorized variant using fast_tanh
* refactored sigmoid using the new interface
* refactored gelu
* add silu activation
* add hardswish
* remove sigmoid for now
* add description to silu and hardswish, and other doc update
* Do not ignore Round
* use constant N
* Set isHeavy = true in sigmoid and silu epilogue
2021-12-18 14:58:15 -05:00
Andrew Kerr
ec4f7e5194
Updates to fused epilogue ( #383 )
...
* Enhancements and fixes to fused GEMM and Convolution epilogue.
* Need to explicitly list cudart as unit test library dependency.
2021-12-17 16:04:43 -05:00
Manish Gupta
808c25337a
CUTLASS 2.8 ( #363 )
...
CUTLASS 2.8
2021-11-19 13:26:35 -08:00
reed-lau
3b28642801
fix wmma shape typo
2021-09-28 19:04:09 +08:00
Manish Gupta
2e07c4cc2f
CUTLASS 2.7 ( #318 )
...
CUTLASS 2.7
Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
Thank you for filing these issues!
authored-by: Haicheng Wu haichengw@nvidia.com , Manish Gupta manigupta@nvidia.com , Dustyn Blasig dblasig@nvidia.com , Andrew Kerr akerr@nvidia.com
2021-09-20 11:02:22 -07:00
Haicheng Wu
59e2aa505a
refine the implementation
2021-09-08 13:14:08 +00:00
Haicheng Wu
4e8af93da1
Merge remote-tracking branch 'origin/master' into small_alignment
2021-09-07 20:39:38 +00:00
Manish Gupta
6c2f8f2fb8
CUTLASS 2.6.1 - functional and performance enhancements to strided DGRAD, fixes, and tuning
...
* cutlass 2.6 update
* remove debug prints
* cutlass 2.6.1 (minor update)
* Updated CHANGELOG.
* Minor edit to readme to indicate patch version.
* Minor edit to readme.
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>, Andrew Kerr <akerr@nvidia.com>
2021-09-03 10:26:15 -07:00
Haicheng Wu
598e35401c
Merge remote-tracking branch 'origin/master' into small_alignment
2021-08-16 07:49:08 -07:00
Haicheng Wu
a77c658439
fix epilogue register spill
2021-07-29 14:25:48 -07:00
Manish Gupta
1ac4559d12
Cutlass 2.6 Update 1 ( #301 )
...
* cutlass 2.6 update
* remove debug prints
2021-07-27 17:58:30 -07:00
Manish Gupta
e5d51840e8
CUTLASS 2.6 ( #298 )
...
CUTLASS 2.6
2021-07-23 00:40:53 -04:00
Tian Jin
e3c56b0d6b
Update predicated_tile_iterator.h
2021-07-05 12:11:53 -04:00
Tian Jin
4647c57243
Update predicated_tile_iterator.h
2021-07-05 12:06:41 -04:00
Manikandan Ananth
47ebfccbec
bug fixes
2021-06-02 10:08:25 -07:00
Manikandan Ananth
da2f110906
Fixes for public issue #265
2021-05-19 10:16:52 -07:00
Zheng Zeng
a68d7cd6f1
Adds NoBetaScaling
for LinearCombination
2021-05-12 22:23:55 +08:00
Haicheng Wu
f58b843951
Merge pull request #239 from KeDengMS/kedeng/gelu
...
Fixes to Gelu for half and fusion
2021-05-08 12:51:42 -04:00
Haicheng Wu
5fc142296f
Merge pull request #237 from Peter9606/issue_236_typo
...
Typo fix issue#236
2021-05-08 07:51:19 -04:00
Haicheng Wu
233d69aa6d
Merge pull request #235 from Peter9606/issue_233_tranpose_update
...
tranpose.h update based on issue#233
2021-05-07 07:14:30 -04:00
mengchi.hmc
f4b0a33633
add unit test for non int4 load
2021-04-23 14:33:46 +08:00
mengchi.hmc
bb35a3ba6f
support setting load granularity for conv2d fprop
2021-04-22 15:20:57 +08:00
mengchi.hmc
7ec3a87f22
support unalignment input for conv2d fprop stage=2 Fix for issue #242
2021-04-21 14:40:05 +08:00
KeDengMS
0b74c8f473
Address CR
2021-04-19 23:36:06 +00:00
KeDengMS
83036ed646
More clean up
2021-04-18 04:29:20 +00:00
KeDengMS
41a31b404b
Fixes to Gelu for half and fusion
2021-04-17 22:10:19 +00:00
Peter Han
7320aee17d
Typo fix issue#236
...
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-04-15 15:08:35 +08:00
Peter Han
2142a05d9d
tranpose.h update based on issue#233
...
1. Add 'pragma once' preprocess directive
2. Replace prmt PTX with __byte_perm intrinsic
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-04-14 19:58:00 +08:00
Manikandan Ananth
08993707da
fixing functional bug in fused epilogue
2021-04-09 11:36:03 -07:00
Manikandan Ananth
4839b6cb61
add 2stage fprop 3d into default file
2021-04-07 13:29:32 -07:00
Haicheng Wu
d97214987a
Merge pull request #220 from Peter9606/wrong-stride-array-definition
...
Bugfix: typo, make reduction device cases passed
2021-04-02 08:43:52 -04:00
Peter Han
7074047a54
Bugfix: typo, make reduction device cases passed
...
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-04-02 09:35:23 +08:00
Manikandan Ananth
75a4737cfe
Fix for public issue #211
...
- Add a slice-K tile size to the profiler
- fix num warps calculations in implicit gemm header
2021-04-01 14:42:00 -07:00
Peter Han
6a6b4028bd
Revert wrong fix of params.update in GemmUniversalBase
...
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-03-23 23:20:40 +08:00
Peter Han
92393b2676
Bugfix: memsetAsync uses wrong default stream
...
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-03-23 21:11:42 +08:00
Peter Han
169181f30f
Make Shape public from Mma_HFMA2.
...
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-03-04 11:05:16 +08:00
Andrew Kerr
746b7b3247
Enabled tensor reduction kernels.
2021-02-26 15:32:19 -05:00
Andrew Kerr
0e13748649
CUTLASS 2.5
2021-02-26 09:58:26 -05:00
Manish Gupta
6615010cd0
CUTLASS 2.4 (Implicit GEMM convolution) ( #147 )
...
CUTLASS 2.4 (Implicit GEMM Convolution)
Co-authored-by: Manish Gupta <manigupta@nvidia.com>, Haicheng Wu <haichengw@nvidia.com>, Dustyn Blasig <dblasig@nvidia.com>, Andrew Kerr <akerr@nvidia.com>
2020-11-19 21:25:25 -08:00
akerr
37a8f9e598
CUTLASS 2.3.0 final.
2020-09-25 10:34:46 -07:00
Andrew Kerr
c53f3339bb
CUTLASS 2.3 initial commit ( #134 )
...
CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.
2020-09-23 14:00:58 -07:00
hwu36
4dac7490e6
Typoes ( #107 )
...
* Update splitk_gemm.cu
* Update gemm_bias_relu.cu
* Update mma_sm75.h
2020-07-13 14:25:52 -07:00
Andrew Kerr
1ab1027954
Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. ( #100 )
...
- Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>.
- Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out
- Added test_examples target to build and test all CUTLASS examples
- Minor edits to documentation to point to GTC 2020 webinar
2020-06-15 10:47:01 -07:00
Andrew Kerr
86931fef85
CUTLASS 2.2 ( #96 )
...
Adds support for NVIDIA Ampere Architecture features. CUDA 11 Toolkit recommended.
2020-06-08 16:17:35 -07:00