563 lines
20 KiB
Markdown
563 lines
20 KiB
Markdown

|
|
|
|
[README](/README.md#documentation) > **Quick Start**
|
|
|
|
# Quickstart
|
|
|
|
## Prerequisites
|
|
|
|
CUTLASS requires:
|
|
- NVIDIA CUDA Toolkit (9.2 or later required, [11.1](https://developer.nvidia.com/cuda-toolkit) recommended)
|
|
- CMake 3.12+
|
|
- host compiler supporting C++11 or greater (g++ 7.3.0 or Microsoft Visual Studio 2015 recommended)
|
|
- Python 3.6+
|
|
|
|
CUTLASS may be optionally compiled and linked with
|
|
- cuBLAS
|
|
- cuDNN v7.6 or later
|
|
|
|
## Initial build steps
|
|
|
|
Construct a build directory and run CMake.
|
|
```bash
|
|
$ export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc
|
|
|
|
$ mkdir build && cd build
|
|
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 # compiles for NVIDIA Ampere GPU architecture
|
|
```
|
|
|
|
If your goal is strictly to build only the CUTLASS Profiler and to minimize compilation time, we suggest
|
|
executing the following CMake command in an empty `build/` directory.
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_ENABLE_TESTS=OFF -DCUTLASS_UNITY_BUILD_ENABLED=ON
|
|
```
|
|
|
|
This reduces overall compilation time by excluding unit tests and enabling the unit build.
|
|
|
|
You may reduce build times by compiling only certain operations by setting the `CUTLASS_LIBRARY_OPERATIONS` flag as shown below,
|
|
executed from an empty `build/` directory. This only compiles 2-D convolution kernels.
|
|
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_OPERATIONS=conv2d
|
|
```
|
|
|
|
You may also filter kernels by name by supplying a filter string with flag `CUTLASS_LIBRARY_KERNELS`.
|
|
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=s16816gemm,s16816fprop*128x128
|
|
```
|
|
See more examples on selectively compiling CUTLASS GEMM and convolution kernels [here](media/docs/quickstart.md#example-cmake-commands).
|
|
|
|
You may explicitly exclude cuBLAS and cuDNN as dependencies with the following CMake flags.
|
|
- `-DCUTLASS_ENABLE_CUBLAS=OFF`
|
|
- `-DCUTLASS_ENABLE_CUDNN=OFF`
|
|
|
|
|
|
## Build and run the CUTLASS Profiler
|
|
|
|
From the `build/` directory created above, compile the the CUTLASS Profiler.
|
|
```bash
|
|
$ make cutlass_profiler -j12
|
|
```
|
|
|
|
Then execute the CUTLASS Profiler computing GEMM, execute the following command.
|
|
```bash
|
|
$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096
|
|
|
|
=============================
|
|
Problem ID: 1
|
|
|
|
Provider: CUTLASS
|
|
Operation: cutlass_simt_sgemm_128x128_nn
|
|
|
|
Disposition: Passed
|
|
Status: Success
|
|
|
|
Arguments: --m=4352 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 \
|
|
--split_k_slices=1 --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 \
|
|
--stages=2 --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 \
|
|
--max_cc=1024
|
|
|
|
Bytes: 52428800 bytes
|
|
FLOPs: 146064539648 flops
|
|
|
|
Runtime: 10.5424 ms
|
|
Memory: 4.63158 GiB/s
|
|
|
|
Math: 13854.9 GFLOP/s
|
|
```
|
|
|
|
To execute the CUTLASS Profiler for convolution, run the following example.
|
|
```bash
|
|
$ ./tools/profiler/cutlass_profiler --kernels=s1688fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --pad_h=1 --pad_w=1
|
|
```
|
|
|
|
To execute all CUTLASS 2-D convolution operators, execute the following.
|
|
```bash
|
|
$ ./tools/profiler/cutlass_profiler --operation=conv2d --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
|
|
|
|
|
|
=============================
|
|
Problem ID: 1
|
|
|
|
Provider: CUTLASS
|
|
OperationKind: conv2d
|
|
Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
|
|
|
|
Status: Success
|
|
Verification: ON
|
|
Disposition: Passed
|
|
|
|
reference_device: Passed
|
|
|
|
Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \
|
|
--stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc \
|
|
--conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \
|
|
--eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4 \
|
|
--warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
|
|
|
|
Bytes: 2055798784 bytes
|
|
FLOPs: 118482796544 flops
|
|
|
|
Runtime: 8.13237 ms
|
|
Memory: 235.431 GiB/s
|
|
|
|
Math: 14569.3 GFLOP/s
|
|
|
|
```
|
|
|
|
See [documentation for the CUTLASS Profiler](profiler.md) for more details.
|
|
|
|
## Build and run CUTLASS Unit Tests
|
|
|
|
From the `build/` directory created above, simply build the target `test_unit` to compile and run
|
|
all unit tests.
|
|
|
|
```bash
|
|
$ make test_unit -j
|
|
...
|
|
...
|
|
...
|
|
[----------] Global test environment tear-down
|
|
[==========] 946 tests from 57 test cases ran. (10812 ms total)
|
|
[ PASSED ] 946 tests.
|
|
$
|
|
```
|
|
The exact number of tests run is subject to change as we add more functionality.
|
|
|
|
No tests should fail. Unit tests automatically construct the appropriate runtime filters
|
|
to avoid executing on architectures that do not support all features under test.
|
|
|
|
The unit tests are arranged hierarchically mirroring the CUTLASS Template Library. This enables
|
|
parallelism in building and running tests as well as reducing compilation times when a specific
|
|
set of tests are desired.
|
|
|
|
For example, the following executes strictly the warp-level GEMM tests.
|
|
```bash
|
|
$ make test_unit_gemm_warp -j
|
|
...
|
|
...
|
|
[----------] 3 tests from SM75_warp_gemm_tensor_op_congruous_f16
|
|
[ RUN ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x8_32x128x8_16x8x8
|
|
[ OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x8_32x128x8_16x8x8 (0 ms)
|
|
[ RUN ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_64x64x32_16x8x8
|
|
[ OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_64x64x32_16x8x8 (2 ms)
|
|
[ RUN ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_32x32x32_16x8x8
|
|
[ OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_32x32x32_16x8x8 (1 ms)
|
|
[----------] 3 tests from SM75_warp_gemm_tensor_op_congruous_f16 (3 ms total)
|
|
...
|
|
...
|
|
[----------] Global test environment tear-down
|
|
[==========] 104 tests from 32 test cases ran. (294 ms total)
|
|
[ PASSED ] 104 tests.
|
|
[100%] Built target test_unit_gemm_warp
|
|
```
|
|
|
|
## Building for Multiple Architectures
|
|
|
|
To minimize compilation time, specific GPU architectures can be enabled via the CMake command,
|
|
selected by [CUDA Compute Capability.](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities)
|
|
|
|
**NVIDIA Ampere Architecture.**
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 # compiles for NVIDIA Ampere GPU architecture
|
|
```
|
|
|
|
**NVIDIA Turing Architecture.**
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 # compiles for NVIDIA Turing GPU architecture
|
|
```
|
|
|
|
**NVIDIA Volta Architecture.**
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS=70 # compiles for NVIDIA Volta GPU architecture
|
|
```
|
|
|
|
**NVIDIA Pascal Architecture.**
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS="60;61" # compiles for NVIDIA Pascal GPU architecture
|
|
```
|
|
|
|
**NVIDIA Maxwell Architecture.**
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS="50;53" # compiles for NVIDIA Maxwell GPU architecture
|
|
```
|
|
|
|
## Clang
|
|
|
|
For experimental purposes, CUTLASS may be compiled with
|
|
[clang 8.0](https://github.com/llvm/llvm-project/releases/download/llvmorg-8.0.1/clang+llvm-8.0.1-amd64-unknown-freebsd11.tar.xz) using the
|
|
[CUDA 10.0 Toolkit](https://developer.nvidia.com/cuda-10.0-download-archive).
|
|
At this time, compiling with clang enables the CUTLASS SIMT GEMM kernels (sgemm, dgemm, hgemm, igemm)
|
|
but does not enable TensorCores.
|
|
|
|
```bash
|
|
$ mkdir build && cd build
|
|
|
|
$ cmake -DCUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ ..
|
|
|
|
$ make test_unit -j
|
|
```
|
|
|
|
|
|
## Using CUTLASS within other applications
|
|
|
|
Applications should list [`/include`](/include) within their include paths. They must be
|
|
compiled as C++11 or greater.
|
|
|
|
**Example:** print the contents of a variable storing half-precision data.
|
|
```c++
|
|
#include <iostream>
|
|
#include <cutlass/cutlass.h>
|
|
#include <cutlass/numeric_types.h>
|
|
#include <cutlass/core_io.h>
|
|
|
|
int main() {
|
|
|
|
cutlass::half_t x = 2.25_hf;
|
|
|
|
std::cout << x << std::endl;
|
|
|
|
return 0;
|
|
}
|
|
```
|
|
|
|
## Launching a GEMM kernel in CUDA
|
|
|
|
**Example:** launch a mixed-precision GEMM targeting Turing Tensor Cores.
|
|
|
|
_Note, this example uses CUTLASS Utilities. Be sure `tools/util/include` is listed as an include path._
|
|
```c++
|
|
#include <cutlass/numeric_types.h>
|
|
#include <cutlass/gemm/device/gemm.h>
|
|
|
|
#include <cutlass/util/host_tensor.h>
|
|
|
|
int main() {
|
|
|
|
// Define the GEMM operation
|
|
using Gemm = cutlass::gemm::device::Gemm<
|
|
cutlass::half_t, // ElementA
|
|
cutlass::layout::ColumnMajor, // LayoutA
|
|
cutlass::half_t, // ElementB
|
|
cutlass::layout::ColumnMajor, // LayoutB
|
|
cutlass::half_t, // ElementOutput
|
|
cutlass::layout::ColumnMajor, // LayoutOutput
|
|
float, // ElementAccumulator
|
|
cutlass::arch::OpClassTensorOp, // tag indicating Tensor Cores
|
|
cutlass::arch::Sm75 // tag indicating target GPU compute architecture
|
|
>;
|
|
|
|
Gemm gemm_op;
|
|
cutlass::Status status;
|
|
|
|
//
|
|
// Define the problem size
|
|
//
|
|
int M = 512;
|
|
int N = 256;
|
|
int K = 128;
|
|
|
|
float alpha = 1.25f;
|
|
float beta = -1.25f;
|
|
|
|
//
|
|
// Allocate device memory
|
|
//
|
|
|
|
cutlass::HostTensor<cutlass::half_t, cutlass::layout::ColumnMajor> A({M, K});
|
|
cutlass::HostTensor<cutlass::half_t, cutlass::layout::ColumnMajor> B({K, N});
|
|
cutlass::HostTensor<cutlass::half_t, cutlass::layout::ColumnMajor> C({M, N});
|
|
|
|
cutlass::half_t const *ptrA = A.device_data();
|
|
cutlass::half_t const *ptrB = B.device_data();
|
|
cutlass::half_t const *ptrC = C.device_data();
|
|
cutlass::half_t *ptrD = C.device_data();
|
|
|
|
int lda = A.device_ref().stride(0);
|
|
int ldb = B.device_ref().stride(0);
|
|
int ldc = C.device_ref().stride(0);
|
|
int ldd = C.device_ref().stride(0);
|
|
//
|
|
// Launch GEMM on the device
|
|
//
|
|
|
|
status = gemm_op({
|
|
{M, N, K},
|
|
{ptrA, lda}, // TensorRef to A device tensor
|
|
{ptrB, ldb}, // TensorRef to B device tensor
|
|
{ptrC, ldc}, // TensorRef to C device tensor
|
|
{ptrD, ldd}, // TensorRef to D device tensor - may be the same as C
|
|
{alpha, beta} // epilogue operation arguments
|
|
});
|
|
|
|
if (status != cutlass::Status::kSuccess) {
|
|
return -1;
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
```
|
|
|
|
Note, the above could be simplified as follows using helper methods defined in `HostTensor`.
|
|
```c++
|
|
cutlass::HostTensor<cutlass::half_t, cutlass::layout::ColumnMajor> A({M, K});
|
|
cutlass::HostTensor<cutlass::half_t, cutlass::layout::ColumnMajor> B({K, N});
|
|
cutlass::HostTensor<cutlass::half_t, cutlass::layout::ColumnMajor> C({M, N});
|
|
|
|
//
|
|
// Use the TensorRef returned by HostTensor::device_ref().
|
|
//
|
|
|
|
status = gemm_op({
|
|
{M, N, K},
|
|
A.device_ref(), // TensorRef to A device tensor
|
|
B.device_ref(), // TensorRef to B device tensor
|
|
C.device_ref(), // TensorRef to C device tensor
|
|
C.device_ref(), // TensorRef to D device tensor - may be the same as C
|
|
{alpha, beta} // epilogue operation arguments
|
|
});
|
|
```
|
|
|
|
# CUTLASS Library
|
|
|
|
The [CUTLASS Library](./tools/library) defines an API for managing and executing collections of compiled
|
|
kernel instances and launching them from host code without template instantiations in client code.
|
|
|
|
The host-side launch API is designed to be analogous to BLAS implementations for convenience, though its
|
|
kernel selection procedure is intended only to be functionally sufficient. It may not launch the
|
|
optimal tile size for a given problem. It chooses the first available kernel whose data types,
|
|
layouts, and alignment constraints satisfy the given problem. Kernel instances and a data structure
|
|
describing them are completely available to client applications which may choose to implement their
|
|
own selection logic.
|
|
|
|
[cuBLAS](https://developer.nvidia.com/cublas) offers the best performance and functional coverage
|
|
for dense matrix computations on NVIDIA GPUs.
|
|
|
|
The CUTLASS Library is used by the CUTLASS Profiler to manage kernel instances, and it is also used
|
|
by several SDK examples.
|
|
|
|
* [10_planar_complex](/examples/10_planar_complex/planar_complex.cu)
|
|
* [11_planar_complex_array](/examples/11_planar_complex_array/planar_complex_array.cu)
|
|
|
|
The CUTLASS Library defines enumerated types describing numeric data types, matrix and tensor
|
|
layouts, math operation classes, complex transformations, and more.
|
|
|
|
Client applications should specify [`tools/library/include`](/tools/library/include) in their
|
|
include paths and link against libcutlas_lib.so.
|
|
|
|
The CUTLASS SDK example [10_planar_complex](/examples/10_planar_complex/CMakeLists.txt) specifies
|
|
its dependency on the CUTLASS Library with the following CMake command.
|
|
```
|
|
target_link_libraries(
|
|
10_planar_complex
|
|
PRIVATE
|
|
cutlass_lib
|
|
cutlass_tools_util_includes
|
|
)
|
|
```
|
|
|
|
A sample kernel launch from host-side C++ is shown as follows.
|
|
|
|
```c++
|
|
#include "cutlass/library/library.h"
|
|
#include "cutlass/library/handle.h"
|
|
|
|
int main() {
|
|
|
|
//
|
|
// Define the problem size
|
|
//
|
|
int M = 512;
|
|
int N = 256;
|
|
int K = 128;
|
|
|
|
float alpha = 1.25f;
|
|
float beta = -1.25f;
|
|
|
|
//
|
|
// Allocate device memory
|
|
//
|
|
|
|
cutlass::HostTensor<float, cutlass::layout::ColumnMajor> A({M, K});
|
|
cutlass::HostTensor<float, cutlass::layout::ColumnMajor> B({K, N});
|
|
cutlass::HostTensor<float, cutlass::layout::ColumnMajor> C({M, N});
|
|
|
|
float const *ptrA = A.device_data();
|
|
float const *ptrB = B.device_data();
|
|
float const *ptrC = C.device_data();
|
|
float *ptrD = C.device_data();
|
|
|
|
int lda = A.device_ref().stride(0);
|
|
int ldb = B.device_ref().stride(0);
|
|
int ldc = C.device_ref().stride(0);
|
|
int ldd = D.device_ref().stride(0);
|
|
|
|
//
|
|
// CUTLASS Library call to execute device GEMM
|
|
//
|
|
|
|
cutlass::library::Handle handle;
|
|
|
|
//
|
|
// Launch GEMM on CUDA device.
|
|
//
|
|
|
|
cutlass::Status status = handle.gemm(
|
|
M,
|
|
N,
|
|
K,
|
|
|
|
cutlass::library::NumericTypeID::kF32, // data type of internal accumulation
|
|
cutlass::library::NumericTypeID::kF32, // data type of alpha/beta scalars
|
|
|
|
&alpha, // pointer to alpha scalar
|
|
|
|
cutlass::library::NumericTypeID::kF32, // data type of A matrix
|
|
cutlass::library::LayoutTypeID::kColumnMajor, // layout of A matrix
|
|
ptrA, // pointer to A matrix in device memory
|
|
lda, // leading dimension of A matrix
|
|
|
|
cutlass::library::NumericTypeID::kF32, // data type of B matrix
|
|
cutlass::library::LayoutTypeID::kColumnMajor, // layout of B matrix
|
|
ptrB, // pointer to B matrix in device memory
|
|
ldb, // leading dimension of B matrix
|
|
|
|
&beta, // pointer to beta scalar
|
|
|
|
cutlass::library::NumericTypeID::kF32, // data type of C and D matrix
|
|
|
|
ptrC, // pointer to C matrix in device memory
|
|
ldc, // leading dimension fo C matrix
|
|
|
|
ptrD, // pointer to D matrix in device memory
|
|
ldd // leading dimension of D matrix
|
|
);
|
|
|
|
if (status != cutlass::Status::kSuccess) {
|
|
return -1;
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
```
|
|
|
|
# Example CMake Commands
|
|
|
|
To instantiate all operations supporting all tile sizes, data types, and alignment constraints, specify
|
|
`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=all
|
|
```
|
|
The above command line generates about seven thousand kernels targetting NVIDIA Ampere, Turing, and Volta architectures.
|
|
Compiling thousands of kernels for three different architectures is time consuming. Additionaly, this would also result
|
|
in a large binary size and on some platforms linker to fail on building the library.
|
|
|
|
Enabling the "unity build" instantiates multiple kernel instances in each compilation unit, thereby reducing binary size
|
|
and avoiding linker limitations on some platforms.
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
|
|
```
|
|
|
|
It is advised to only compile CUTLASS kernels for NVIDIA architectures one plans on running. Furthermore, kernels
|
|
can be selectively included in the CUTLASS Library by specifying filter strings and wildcard characters when executing CMake.
|
|
|
|
Several examples are defined below for convenience. They may be combined as a comma-delimited list.
|
|
Compling only the kernels desired reduces compilation time.
|
|
|
|
|
|
## GEMM CMake Examples
|
|
**Example.** All GEMM kernels targeting NVIDIA Ampere Tensor Cores.
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
|
|
```
|
|
|
|
**Example.** All GEMM kernels targeting NVIDIA Turing Tensor Cores.
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
|
|
```
|
|
|
|
**Example.** All GEMM kernels with FP32 accumulation targeting NVIDIA Ampere, Turing, and Volta architectures.
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=s*gemm
|
|
```
|
|
|
|
**Example.** All kernels which expect A and B to be column-major or row-major targeting NVIDIA Ampere, Turing, and Volta architectures.
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=gemm*nn,gemm*tt
|
|
```
|
|
|
|
**Example.** All planar complex GEMM variants targeting NVIDIA Ampere, Turing, and Volta architectures.
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=planar_complex
|
|
```
|
|
|
|
## Convolution CMake Examples
|
|
**Example.** All convolution kernels targeting NVIDIA Ampere's 16816 Tensor Core operation
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS='80' -DCUTLASS_LIBRARY_KERNELS=s16816fprop,s16816dgrad,s16816wgrad
|
|
```
|
|
|
|
**Example.** All forward propagation (fprop) convolution kernels targeting CUDA Cores for multiple NVIDIA architectures
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS='50;60;61;70;75;80' -DCUTLASS_LIBRARY_KERNELS=sfprop
|
|
```
|
|
|
|
**Example.** All forward propagation (fprop) convolution kernels with FP32 accumulation and FP16 input targetting NVIDIA Ampere's 16816 Tensor Core operation
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS='80' -DCUTLASS_LIBRARY_KERNELS=s16816fprop_*_f16
|
|
```
|
|
|
|
**Example.** All backward weight gradient (wgrad) convolution kernels with FP32 accumulation, FP16 input, and optimized global memory iterator
|
|
targetting NVIDIA Ampere, Turing, and Volta Tensor Core operations
|
|
```bash
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=tensorop*s*wgrad_optimized_f16
|
|
```
|
|
|
|
# Copyright
|
|
|
|
Copyright (c) 2017-2021, NVIDIA CORPORATION. All rights reserved.
|
|
|
|
```
|
|
Redistribution and use in source and binary forms, with or without modification, are permitted
|
|
provided that the following conditions are met:
|
|
* Redistributions of source code must retain the above copyright notice, this list of
|
|
conditions and the following disclaimer.
|
|
* Redistributions in binary form must reproduce the above copyright notice, this list of
|
|
conditions and the following disclaimer in the documentation and/or other materials
|
|
provided with the distribution.
|
|
* Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
|
|
to endorse or promote products derived from this software without specific prior written
|
|
permission.
|
|
|
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
|
|
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
|
|
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
|
|
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
|
|
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
|
|
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
|
|
STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
```
|