Commit Graph

203 Commits

Author SHA1 Message Date
Yang Chen
095cbba57c
Example 23 - Passing correct alpha and beta values with --parallel-split-k (#424)
When split-k is enabled, we should set alpha to 1 and beta to 0 for the
split-k gemm kernel.

The fix was from hwu36. I only did fixed some minor typos along with his
fix.
2022-03-22 12:27:34 -04:00
Janusz Lisiecki
8f1fe7a132
Fix separate compilation -dc (#433)
* Fix separate compilation `-dc`

- when cutlass is included in multiple compilation units
  compiled with `-dc` OOB_NAN_F16x8 device constant is
  instantiated multiple times causing
  Multiple definition of '_ZN7cutlass4arch13OOB_NAN_F16x8E' error
  This PR makes this variable a local constant as it is not
  modified during runtime

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

* Fix

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

* Test GH

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

* Revert test GH

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
2022-03-22 12:21:18 -04:00
Yuanqiang Liu
3ab1eacf09
Fix typo in profiler examples (#437) 2022-03-21 12:00:13 -04:00
Feng Shijie
cd39c75e25
Fix typo in docs, code comments (#429)
* [docs] fix typo in media/docs/layout.md

* [docs] fix comment error

* fix typo in include/cutlass/arch/simd_61.h

* fix stride comment errors in TensorLayout
2022-03-15 21:54:36 -04:00
Haicheng Wu
b2e1e97cb1
Update PUBLICATIONS.md
ACM Trans on Graphics from nv research.
2022-03-01 22:37:18 -05:00
HouQiming
96a11a1ef3
Removed trivial copy constructors on parameter classes to enable devi… (#366)
* Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels

* Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code

* std => platform

* fix affine2

* really fix affine2

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-02-28 21:34:02 -05:00
Ivan Komarov
e96f00586c
Make cutlass::gemm::device::GemmArray usable (#295)
* Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray

* Add a reference to GemmArray to the docs

Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
2022-02-17 20:01:05 -05:00
Jongsoo Park
3cfa5db2a2
Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op… (#407)
* Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu

As title

* Update gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu

change the missing one

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2022-02-16 09:53:21 -05:00
Jongsoo Park
1db6971a8d
Remove unused gemm_k_iterations in GemmKernel::Params (#406)
Otherwise we get gemm_k_iterations is uninitialized warnings.
2022-02-16 09:52:45 -05:00
Haicheng Wu
b954127297
Update PUBLICATIONS.md
@jackkosaian
2022-02-14 16:54:32 -05:00
Bing Xu
d0d941efc7
[hardswish] correct implmentation (#403)
* [hardswish] correct implmentation

* seems working

* hardswish fp32/fp16x2 optimization

* [relu] half2 support

* add relu0; add multiply_add_relu0;

* cleanup

Co-authored-by: Bing Xu <bingxu@fb.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-02-09 14:28:53 -05:00
Andrew Kerr
8a951b2940
Enable convolution with fused epilogue for Volta Tensor Cores (#402)
* Enabled convolution with epilogue fusion for Volta Tensor Cores.

* Compilation fixes

* Disabled testing Volta on Ampere architectures.
2022-01-30 23:24:50 -05:00
Fujun Han
1e4703cbab
Support parallel split K mode for porfiling (#277)
* Support parallel split K mode for porfiling

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

* Parallel Split K support

  1. find gemm kernel by preference key
  2. switch m n for redution kernel

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

* parallel splitk for fp16 gemm

* add one missing file

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-01-27 10:37:37 -05:00
Dustyn Blasig
c3353add63
Merge pull request #388 from depaulmillz/fix/headersonly
Fix utils include not being installed in header only
2022-01-26 14:22:51 -06:00
dePaul Miller
ac8825b941 Minor fix to change from LIBRARY_INIT to LIBRARY 2022-01-26 15:17:46 -05:00
Haicheng Wu
8fd94806e5
Update PUBLICATIONS.md
add mlsys 2022 paper.
2022-01-17 00:08:18 -05:00
Masahiro Masuda
d7c9cbf0b9
Fix typo in scripts/library.py (wrong data size for u8) (#393) 2022-01-07 13:29:56 -05:00
masahi
c2ee13a0fe
Add epilogue functor for residual block fusion (#391)
* Add epilogue functor for residual block fusion

* Do not run split-k tests when ActivationOp is not Identity

* explain TestSplitK param

* return early
2021-12-29 22:53:40 -05:00
Haicheng Wu
f78994bb40
add the missing pieces (#392)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2021-12-25 04:29:54 -08:00
masahi
dceabd4c5a
Support half precision sigmoid activation (#378)
* Support half precision sigmoid activation

* introduce a vectorized variant using fast_tanh

* move the math to fast_math.h

* fixed compile

* .raw() -> .to_half()

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2021-12-22 14:45:06 -05:00
dePaul Miller
86fa1dc30b Fix utils include not being installed in header only 2021-12-21 12:10:26 -05:00
Andrew Kerr
288af365db
Added missing synchronization to avoid WAR hazards between tiles. (#386) 2021-12-20 08:34:08 -08:00
masahi
0dc3ba60b3
Refactor GELU and Sigmoid epilogue to use a common template (and add SiLu, Hardswish epilogue) (#379)
* Support half precision sigmoid activation

* introduce a vectorized variant using fast_tanh

* refactored sigmoid using the new interface

* refactored gelu

* add silu activation

* add hardswish

* remove sigmoid for now

* add description to silu and hardswish, and other doc update

* Do not ignore Round

* use constant N

* Set isHeavy = true in sigmoid and silu epilogue
2021-12-18 14:58:15 -05:00
Andrew Kerr
ec4f7e5194
Updates to fused epilogue (#383)
* Enhancements and fixes to fused GEMM and Convolution epilogue.
* Need to explicitly list cudart as unit test library dependency.
2021-12-17 16:04:43 -05:00
Andrew Kerr
4e666e1dfd
Updated README and added issue templates. (#382) 2021-12-17 09:26:20 -05:00
Haicheng Wu
3799e12f25
Merge pull request #381 from Peter9606/update-makefile-version
Update project version to 2.8.0 in CMakeLists.txt
2021-12-16 21:54:57 -05:00
Peter Han
fc3bc85db8 Update project version to 2.8.0 in CMakeLists.txt
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-12-17 02:23:31 +00:00
Matthew Nicely
49c0a58d50 Set theme jekyll-theme-minimal 2021-12-15 14:51:24 -05:00
Andrew Kerr
5fe09c2d67
Updated GEMM performance plot with CUTLASS 2.8 compiled with CUDA 11.5 Toolkit (#375)
Updated GEMM performance plot with CUTLASS 2.8 compiled using CUDA 11.5 Toolkit.

GPUs under test:

    NVIDIA A100
    NVIDIA A2
    NVIDIA TitanV
    NVIDIA GeForce 2080 Ti
2021-12-06 14:21:33 -05:00
Andrew Kerr
6b69c79ac3
Fixed contributor formatting. (#365) 2021-11-22 11:30:53 -08:00
Andrew Kerr
62e438f450
Listed Matthew Nicely as the CUTLASS product manager.. (#364) 2021-11-19 17:51:21 -08:00
Manish Gupta
808c25337a
CUTLASS 2.8 (#363)
CUTLASS 2.8
2021-11-19 13:26:35 -08:00
Haicheng Wu
6fc5008803
Update quickstart.md
fix a broken link
2021-11-11 09:53:46 -05:00
Haicheng Wu
a3bcc6981d
Merge pull request #331 from reed-lau/feature/fix-wmma-shape-typo
fix wmma shape typo
2021-09-28 10:20:29 -04:00
reed-lau
3b28642801 fix wmma shape typo 2021-09-28 19:04:09 +08:00
Manish Gupta
538592dea4
example 23 gemm operand reduction fusion (#325) 2021-09-20 13:34:47 -07:00
Manish Gupta
2e07c4cc2f
CUTLASS 2.7 (#318)
CUTLASS 2.7

Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
Thank you for filing these issues!

authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com
2021-09-20 11:02:22 -07:00
Haicheng Wu
9ac255863f
Merge pull request #246 from mengchihe/master
support unalignment input for conv2d fprop stage=2 Fix for issue #242
2021-09-08 11:40:53 -04:00
Haicheng Wu
59e2aa505a refine the implementation 2021-09-08 13:14:08 +00:00
Haicheng Wu
4e8af93da1 Merge remote-tracking branch 'origin/master' into small_alignment 2021-09-07 20:39:38 +00:00
Manish Gupta
6c2f8f2fb8
CUTLASS 2.6.1 - functional and performance enhancements to strided DGRAD, fixes, and tuning
* cutlass 2.6 update

* remove debug prints

* cutlass 2.6.1 (minor update)

* Updated CHANGELOG.

* Minor edit to readme to indicate patch version.

* Minor edit to readme.

Co-authored-by:  Haicheng Wu <haichengw@nvidia.com>, Andrew Kerr <akerr@nvidia.com>
2021-09-03 10:26:15 -07:00
Haicheng Wu
598e35401c Merge remote-tracking branch 'origin/master' into small_alignment 2021-08-16 07:49:08 -07:00
Manish Gupta
a01feb93d9
Merge pull request #308 from dongxiao92/patch-1
fix typo in doc
2021-08-08 11:54:42 -07:00
dongxiao
d36f331b44
fix typo in doc
fix typo
2021-08-08 16:44:22 +08:00
Haicheng Wu
69abafb85a
Merge pull request #306 from NVIDIA/fix-profiler-cmd-doc
Fix profiler cmd doc
2021-07-30 14:36:54 -04:00
Haicheng Wu
68a078fbbf cleanup 2021-07-30 11:27:21 -07:00
Haicheng Wu
10709dbb64 clean profiler cmd and doc 2021-07-30 11:02:17 -07:00
Manish Gupta
1227351079
Merge pull request #305 from NVIDIA/fix_epilogue_spill
fix epilogue register spill
2021-07-29 14:30:11 -07:00
Haicheng Wu
a77c658439 fix epilogue register spill 2021-07-29 14:25:48 -07:00
Haicheng Wu
4516b833ce
Merge pull request #303 from Peter9606/doc_typo
Doc typo
2021-07-28 20:49:06 -04:00