- adds missing commas
- adjusts misaligned usage of CUTLASS_DEVICE between
template declaration and specializations
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Update CMakeLists.txt
Add 128bit int support if using nvc++ to solve #310
@jeffhammond, would you please give it a try?
* Update CMakeLists.txt
correct copy paste error
* Fixed template struct/class mismatch
* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation
* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation
* Revert absolute_value() usage
`CUDA_PERROR_EXIT ` can lead to incorrect usage (see e.g. [this description](https://www.cs.technion.ac.il/users/yechiel/c++-faq/macros-with-if.html)) because it contains an incomplete `if` expression. Consider:
```
if (condition)
CUDA_PERROR_EXIT(cudaFree(x))
else
free(x);
```
The author of the code forgot to add a semicolon after the macro. In that case, the `else` will bind to the `if` inside the macro definition, leading to code that the author did not intend or expect. It the author does use a semicolon, the code will not compile, which is awkward.
The change adds a `do while` around the `if`, which always requires a semicolon.
This PR also adds the text of the failing expression to the printed error message.
* add split k wgrad example
* wgrad done
* begin transposed conv2d example
* update transposed conv2d example and add ref check
* update doc for conv2d transpose example
* add license
* add wgrad doc
* more clarification on GEMM output type
* typo fix
* clean up indent
* address comments
* rename example numbers to 34 and 35
* GEMM -> Implicit GEMM
* Revert "rename example numbers to 34 and 35"
This reverts commit 551a808c227216e9e38d4472ba8ff020557b8500.
* transposed_conv2d is 34
* add compiler and device version check to exit gracefully
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
When split-k is enabled, we should set alpha to 1 and beta to 0 for the
split-k gemm kernel.
The fix was from hwu36. I only did fixed some minor typos along with his
fix.
* Fix separate compilation `-dc`
- when cutlass is included in multiple compilation units
compiled with `-dc` OOB_NAN_F16x8 device constant is
instantiated multiple times causing
Multiple definition of '_ZN7cutlass4arch13OOB_NAN_F16x8E' error
This PR makes this variable a local constant as it is not
modified during runtime
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Fix
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Test GH
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Revert test GH
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels
* Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code
* std => platform
* fix affine2
* really fix affine2
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray
* Add a reference to GemmArray to the docs
Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
* Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu
As title
* Update gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu
change the missing one
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Support parallel split K mode for porfiling
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
* Parallel Split K support
1. find gemm kernel by preference key
2. switch m n for redution kernel
Signed-off-by: Peter Han <fujun.han@iluvatar.ai>
* parallel splitk for fp16 gemm
* add one missing file
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Add epilogue functor for residual block fusion
* Do not run split-k tests when ActivationOp is not Identity
* explain TestSplitK param
* return early
* Support half precision sigmoid activation
* introduce a vectorized variant using fast_tanh
* move the math to fast_math.h
* fixed compile
* .raw() -> .to_half()
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>