Commit Graph

322 Commits

Author SHA1 Message Date
Haicheng Wu
96dad61a75
Update CHANGELOG.md 2022-04-28 10:52:10 -04:00
Haicheng Wu
cc2ea4c3fc
Update README.md 2022-04-28 10:50:11 -04:00
Andrew Kerr
a0de301283
Used relative paths for includes (#477) 2022-04-27 12:04:23 -07:00
Haicheng Wu
319a389f42
Update CMakeLists.txt (#473)
* Update CMakeLists.txt

Add 128bit int support if using nvc++ to solve #310 

@jeffhammond, would you please give it a try?

* Update CMakeLists.txt

correct copy paste error
2022-04-27 07:02:26 -07:00
Stepan Tezyunichev
71def2f084
Use platform:: instead of std::abs and std::conditional (#452)
* Fixed template struct/class mismatch

* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation

* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation

* Revert absolute_value() usage
2022-04-25 14:40:22 -04:00
Masahiro Masuda
70f3ba57f5
Fix typo in shared memory layout description (#471) 2022-04-24 18:32:13 -04:00
Fujun Han
dd77fadc70
Remove redundant offset def and init in shared_load_iterator.h (#456)
Signed-off-by: Fujun Han <fujun.han@iluvatar.ai>
2022-04-24 16:31:00 -04:00
Stepan Tezyunichev
be4578d517
Fixed template struct/class mismatch (#453) 2022-04-24 16:30:21 -04:00
Andrei Alexandrescu
d7b499deff
Fix CUDA_PERROR_EXIT and print failing expression (#446)
`CUDA_PERROR_EXIT ` can lead to incorrect usage (see e.g. [this description](https://www.cs.technion.ac.il/users/yechiel/c++-faq/macros-with-if.html)) because it contains an incomplete `if` expression. Consider:

```
if (condition)
    CUDA_PERROR_EXIT(cudaFree(x))
else
    free(x);
```

The author of the code forgot to add a semicolon after the macro. In that case, the `else` will bind to the `if` inside the macro definition, leading to code that the author did not intend or expect. It the author does use a semicolon, the code will not compile, which is awkward.

The change adds a `do while` around the `if`, which always requires a semicolon.

This PR also adds the text of the failing expression to the printed error message.
2022-04-24 16:29:43 -04:00
Exusial
310ed81ac3
fix description in example 12. (#444)
Co-authored-by: Exusial <Exusial>
2022-04-24 16:29:06 -04:00
Fujun Han
4c0d6e1eb4
[BUGFIX]: Force unroll a loop that doesn't have compilation constant (#441)
loop times is dangerous.

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2022-04-24 16:28:32 -04:00
Jack Kosaian
167ac54c65
Fix link to Python example (#469) 2022-04-23 15:37:38 -04:00
Andrew Kerr
12f4108ac2
CUTLASS 2.9 (#468) 2022-04-23 15:02:38 -04:00
Feng Shijie
dd571f0edb
[style] fix code indentation (#449)
* [docs] fix typo in media/docs/layout.md

* [docs] fix comment error

* fix typo in include/cutlass/arch/simd_61.h

* fix stride comment errors in TensorLayout

* fix indentation
2022-04-03 21:13:17 -04:00
Jianyu Huang
6d0d265047
Update PUBLICATIONS.md (#447) 2022-04-03 21:03:28 -04:00
Haicheng Wu
f11fa975a5
Update PUBLICATIONS.md
@tsuki
2022-03-23 21:04:43 -04:00
Masahiro Masuda
0e71d9b450
Transposed conv2d and wgrad split k examples (#413)
* add split k wgrad example

* wgrad done

* begin transposed conv2d example

* update transposed conv2d example and add ref check

* update doc for conv2d transpose example

* add license

* add wgrad doc

* more clarification on GEMM output type

* typo fix

* clean up indent

* address comments

* rename example numbers to 34 and 35

* GEMM -> Implicit GEMM

* Revert "rename example numbers to 34 and 35"

This reverts commit 551a808c227216e9e38d4472ba8ff020557b8500.

* transposed_conv2d is 34

* add compiler and device version check to exit gracefully

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-03-23 14:52:54 -04:00
Minmin Sun (孙敏敏)
eb0d4c9213
[library] pass pointer of arguments to get_host_workspace_size() in gemm_universal() (#412)
Otherwise GemmUniversalOperation::get_host_workspace_size() will fail on SegmentFault.
2022-03-22 12:36:34 -04:00
Haojin Yang
bc45e2c023
fixed datatype error of numeric_limit for uint1b_t (#419)
Co-authored-by: Haojin Yang <haojin.yang@.hpi.uni-potsdam.de>
2022-03-22 12:30:30 -04:00
Yang Chen
095cbba57c
Example 23 - Passing correct alpha and beta values with --parallel-split-k (#424)
When split-k is enabled, we should set alpha to 1 and beta to 0 for the
split-k gemm kernel.

The fix was from hwu36. I only did fixed some minor typos along with his
fix.
2022-03-22 12:27:34 -04:00
Janusz Lisiecki
8f1fe7a132
Fix separate compilation -dc (#433)
* Fix separate compilation `-dc`

- when cutlass is included in multiple compilation units
  compiled with `-dc` OOB_NAN_F16x8 device constant is
  instantiated multiple times causing
  Multiple definition of '_ZN7cutlass4arch13OOB_NAN_F16x8E' error
  This PR makes this variable a local constant as it is not
  modified during runtime

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

* Fix

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

* Test GH

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>

* Revert test GH

Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
2022-03-22 12:21:18 -04:00
Yuanqiang Liu
3ab1eacf09
Fix typo in profiler examples (#437) 2022-03-21 12:00:13 -04:00
Feng Shijie
cd39c75e25
Fix typo in docs, code comments (#429)
* [docs] fix typo in media/docs/layout.md

* [docs] fix comment error

* fix typo in include/cutlass/arch/simd_61.h

* fix stride comment errors in TensorLayout
2022-03-15 21:54:36 -04:00
Haicheng Wu
b2e1e97cb1
Update PUBLICATIONS.md
ACM Trans on Graphics from nv research.
2022-03-01 22:37:18 -05:00
HouQiming
96a11a1ef3
Removed trivial copy constructors on parameter classes to enable devi… (#366)
* Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels

* Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code

* std => platform

* fix affine2

* really fix affine2

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-02-28 21:34:02 -05:00
Ivan Komarov
e96f00586c
Make cutlass::gemm::device::GemmArray usable (#295)
* Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray

* Add a reference to GemmArray to the docs

Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
2022-02-17 20:01:05 -05:00
Jongsoo Park
3cfa5db2a2
Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op… (#407)
* Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu

As title

* Update gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu

change the missing one

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2022-02-16 09:53:21 -05:00
Jongsoo Park
1db6971a8d
Remove unused gemm_k_iterations in GemmKernel::Params (#406)
Otherwise we get gemm_k_iterations is uninitialized warnings.
2022-02-16 09:52:45 -05:00
Haicheng Wu
b954127297
Update PUBLICATIONS.md
@jackkosaian
2022-02-14 16:54:32 -05:00
Bing Xu
d0d941efc7
[hardswish] correct implmentation (#403)
* [hardswish] correct implmentation

* seems working

* hardswish fp32/fp16x2 optimization

* [relu] half2 support

* add relu0; add multiply_add_relu0;

* cleanup

Co-authored-by: Bing Xu <bingxu@fb.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-02-09 14:28:53 -05:00
Andrew Kerr
8a951b2940
Enable convolution with fused epilogue for Volta Tensor Cores (#402)
* Enabled convolution with epilogue fusion for Volta Tensor Cores.

* Compilation fixes

* Disabled testing Volta on Ampere architectures.
2022-01-30 23:24:50 -05:00
Fujun Han
1e4703cbab
Support parallel split K mode for porfiling (#277)
* Support parallel split K mode for porfiling

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

* Parallel Split K support

  1. find gemm kernel by preference key
  2. switch m n for redution kernel

Signed-off-by: Peter Han <fujun.han@iluvatar.ai>

* parallel splitk for fp16 gemm

* add one missing file

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-01-27 10:37:37 -05:00
Dustyn Blasig
c3353add63
Merge pull request #388 from depaulmillz/fix/headersonly
Fix utils include not being installed in header only
2022-01-26 14:22:51 -06:00
dePaul Miller
ac8825b941 Minor fix to change from LIBRARY_INIT to LIBRARY 2022-01-26 15:17:46 -05:00
Haicheng Wu
8fd94806e5
Update PUBLICATIONS.md
add mlsys 2022 paper.
2022-01-17 00:08:18 -05:00
Masahiro Masuda
d7c9cbf0b9
Fix typo in scripts/library.py (wrong data size for u8) (#393) 2022-01-07 13:29:56 -05:00
masahi
c2ee13a0fe
Add epilogue functor for residual block fusion (#391)
* Add epilogue functor for residual block fusion

* Do not run split-k tests when ActivationOp is not Identity

* explain TestSplitK param

* return early
2021-12-29 22:53:40 -05:00
Haicheng Wu
f78994bb40
add the missing pieces (#392)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2021-12-25 04:29:54 -08:00
masahi
dceabd4c5a
Support half precision sigmoid activation (#378)
* Support half precision sigmoid activation

* introduce a vectorized variant using fast_tanh

* move the math to fast_math.h

* fixed compile

* .raw() -> .to_half()

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2021-12-22 14:45:06 -05:00
dePaul Miller
86fa1dc30b Fix utils include not being installed in header only 2021-12-21 12:10:26 -05:00
Andrew Kerr
288af365db
Added missing synchronization to avoid WAR hazards between tiles. (#386) 2021-12-20 08:34:08 -08:00
masahi
0dc3ba60b3
Refactor GELU and Sigmoid epilogue to use a common template (and add SiLu, Hardswish epilogue) (#379)
* Support half precision sigmoid activation

* introduce a vectorized variant using fast_tanh

* refactored sigmoid using the new interface

* refactored gelu

* add silu activation

* add hardswish

* remove sigmoid for now

* add description to silu and hardswish, and other doc update

* Do not ignore Round

* use constant N

* Set isHeavy = true in sigmoid and silu epilogue
2021-12-18 14:58:15 -05:00
Andrew Kerr
ec4f7e5194
Updates to fused epilogue (#383)
* Enhancements and fixes to fused GEMM and Convolution epilogue.
* Need to explicitly list cudart as unit test library dependency.
2021-12-17 16:04:43 -05:00
Andrew Kerr
4e666e1dfd
Updated README and added issue templates. (#382) 2021-12-17 09:26:20 -05:00
Haicheng Wu
3799e12f25
Merge pull request #381 from Peter9606/update-makefile-version
Update project version to 2.8.0 in CMakeLists.txt
2021-12-16 21:54:57 -05:00
Peter Han
fc3bc85db8 Update project version to 2.8.0 in CMakeLists.txt
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
2021-12-17 02:23:31 +00:00
Matthew Nicely
49c0a58d50 Set theme jekyll-theme-minimal 2021-12-15 14:51:24 -05:00
Andrew Kerr
5fe09c2d67
Updated GEMM performance plot with CUTLASS 2.8 compiled with CUDA 11.5 Toolkit (#375)
Updated GEMM performance plot with CUTLASS 2.8 compiled using CUDA 11.5 Toolkit.

GPUs under test:

    NVIDIA A100
    NVIDIA A2
    NVIDIA TitanV
    NVIDIA GeForce 2080 Ti
2021-12-06 14:21:33 -05:00
Andrew Kerr
6b69c79ac3
Fixed contributor formatting. (#365) 2021-11-22 11:30:53 -08:00
Andrew Kerr
62e438f450
Listed Matthew Nicely as the CUTLASS product manager.. (#364) 2021-11-19 17:51:21 -08:00