cutlass 2.4 documentation only update

This commit is contained in:
Manish Gupta 2020-11-22 18:11:37 -08:00 committed by Dustyn Blasig
parent e6bcdc60cf
commit ccb697bac7
6 changed files with 279 additions and 104 deletions

View File

@ -8,7 +8,7 @@
* Spatial dimensions: 1-D, 2-D, and 3-D
* Layout: NHWC, NCxHWx
* Implicit GEMM convolution components:
* Global memory iterators supporting fprop, dgrad, and wgrad
* Global memory iterators supporting Fprop, Dgrad, and Wgrad
* `MmaMultistage` for implicit GEMM convolution for NVIDIA Ampere architecture
* `MmaPipeline` for implicit GEMM convolution for NVIDIA Volta and Turing architectures
* [Documentation](/media/docs/implicit_gemm_convolution.md) describing Implicit GEMM Convolution algorithm and implementation

133
README.md
View File

@ -288,6 +288,7 @@ It can be built as follows:
```bash
$ make cutlass_profiler -j16
```
## Building all GEMM and Convolution kernels (_long_ build times)
By default, only one tile size is instantiated for each data type, math instruction, and layout.
To instantiate all, set the following environment variable when running CMake from an empty `build/` directory.
@ -298,17 +299,71 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all
$ make cutlass_profiler -j16
```
To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with
wildcard characters may be reduce the set of kernels. The following builds exactly one kernel:
## Building a subset of GEMM and Convolution kernels (_reduced_ build times)
To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with
wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one
or a subset of kernels for NVIDIA Ampere and Turing architecture:
### Building a subset Tensor Core GEMM kernels
To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targetting NVIDIA Ampere and Turing architecture,
use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
...
$ make cutlass_profiler -j16
```
Example command line for profiling SGEMM kernels is as follows:
Example command line for profiling a subset of Tensor Core GEMM kernels is as follows:
```bash
./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096
...
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: gemm
Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
cuBLAS: Passed
Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1 \
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128 \
--cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 \
--max_cc=1024
Bytes: 118489088 bytes
FLOPs: 115992428544 flops
Runtime: 1.55948 ms
Memory: 70.7616 GiB/s
Math: 74378.8 GFLOP/s
=============================
...
```
### Building one CUDA Core GEMM kernel
To compile one SGEMM kernel targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
...
$ make cutlass_profiler -j16
```
Example command line for profiling single SGEMM CUDA kernel is as follows:
```bash
$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
=============================
@ -335,24 +390,69 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
Memory: 24.934 GiB/s
Math: 17218.4 GFLOP/s
=============================
```
To compile strictly 2-D or 3-D convolution kernels, filter by operation
### Building a subset of Tensor Core Convolution kernels
To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation
and FP16 input targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_OPERATIONS=conv2d,conv3d
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
...
$ make cutlass_profiler -j16
```
or by name
Example command line for profiling a subset of Tensor Core convolution kernels is as follows:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=sfprop,s16816fprop,s16816dgrad,s16816wgrad
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
...
=============================
Problem ID: 1
Provider: CUTLASS
OperationKind: conv2d
Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc
Status: Success
Verification: ON
Disposition: Passed
reference_device: Passed
Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \
--stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc \
--conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \
--eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5 \
--warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
Bytes: 1130659840 bytes
FLOPs: 118482796544 flops
Runtime: 0.711496 ms
Memory: 1479.99 GiB/s
Math: 166526 GFLOP/s
=============================
...
```
### Building one Convolution CUDA kernel
To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation
and FP32 input targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
...
$ make cutlass_profiler -j16
```
Example command line for profiling 2-D convolution kernels is as follows:
Example command line for profiling one CUDA Core convolution kernel:
```bash
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
@ -380,14 +480,21 @@ reference_device: Passed
Bytes: 2055798784 bytes
FLOPs: 118482796544 flops
Runtime: 8.13237 ms
Memory: 235.431 GiB/s
Runtime: 7.34266 ms
Memory: 260.752 GiB/s
Math: 14569.3 GFLOP/s
Math: 16136.2 GFLOP/s
=============================
```
[Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)
## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler
- Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
- [GEMM CMake Examples](media/docs/quickstart.md#gemm-cmake-examples)
- [Implicit GEMM conovlution CMake Examples](media/docs/quickstart.md#convolution-cmake-examples)
- [Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)
# About

View File

@ -56,14 +56,15 @@ One can find and/or create equivalent dgrad and wgrad convolutional operators.
| **Simt** | 50,60,61,70,75 | 9.2+ | `cf32 * cf32 + cf32 => cf32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm50.cu) |
| **TensorOp** | 70 | 10.1+ | `f16 * f16 + f32 => {f16, f32}`| NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm70.cu) |
| **TensorOp** | 75 | 10.2+ | `f16 * f16 + f32 => {f16, f32}`| NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm75.cu) |
| **TensorOp** | 75 | 10.2+ | `s8 * s8 + s32 => {s32, s8}` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu) |
| **Simt** | 80 | 11.0+ | `f32 * f32 + f32 => f32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu) |
| **Simt** | 80 | 11.0+ | `cf32 * cf32 + cf32 => cf32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu) |
| **TensorOp** | 75 | 10.2+ | `s8 * s8 + s32 => {s32, s8}` | NHWC, NCxHWx | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm75.cu) |
| **TensorOp** | 75 | 10.2+ | `s4 * s4 + s32 => {s32, s4}` | NHWC, NCxHWx | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm75.cu) |
| **Simt** | 80 | 11.0+ | `f32 * f32 + f32 => f32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu) |
| **Simt** | 80 | 11.0+ | `cf32 * cf32 + cf32 => cf32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `f16 * f16 + f32 => {f16, f32}`| NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `f16 * f16 + f16 => f16` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `tf32 * tf32 + f32 => f32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `s8 * s8 + s32 => {s32, s8}` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `s4 * s4 + s32 => {s32, s4}` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `s8 * s8 + s32 => {s32, s8}` | NHWC, NCxHWx | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm80.cu) |
| **TensorOp** | 80 | 11.0+ | `s4 * s4 + s32 => {s32, s4}` | NHWC, NCxHWx | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm80.cu) |

View File

@ -51,7 +51,7 @@ f(p, r) = p * stride_h + R - r - 1 + pad_h
g(q, s) = h * stride_w + S - s - 1 + pad_w
```
A [host](/tools/util/include/reference/host/convolution.h) and [device](/tools/util/include/reference/device/convolution.h)
A [host](/tools/util/include/cutlass/util/reference/host/convolution.h) and [device](/tools/util/include/cutlass/util/reference/device/convolution.h)
reference implementation are provided in the CUTLASS Utilities.
This computation may be mapped to the elements of a matrix product as follows.
@ -347,7 +347,7 @@ creating GEMM-B tile in shared memory.
The improvements covered by optimized iterators are:
- (a) Precomputing kernel-invariant pointer deltas on the host
- (b) Computing cta-invariant mask predicates on device-side iterator ctors
- (c) Use of [fast divmod](include/cutlass/fast_math.h) to map GEMM dimenstions to convolution tensors.
- (c) Use of [fast divmod](/include/cutlass/fast_math.h) to map GEMM dimenstions to convolution tensors.
For example, _optimized_ activation iterator uses fast divmod to map GEMM _M_ to NPQ
for activation iterator

View File

@ -5,15 +5,17 @@
# CUTLASS Profiler
The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
defined in the CUTLASS Instance Library.
defined in the CUTLASS Instance Library. The CUTLASS Profiler is capable of executing each GEMM, Sparse Gemm,
Conv2d, and Conv3d kernel.
The CUTLASS Profiler may be compiled with:
```bash
$ make cutlass_profiler -j
```
To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
To limit compilation time, only one tile size (typically 128x128) is instantiated for each data type,
math instruction, and layout. To instantiate all sizes, set the following environment variable when running CMake from an
empty `build/` directory.
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
...
@ -32,82 +34,121 @@ The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profi
```bash
CUTLASS Performance Tool
usage:
cutlass_profiler [options]
--help
--mode={profile*,single,dry,enumerate} Regular profiling, single kernel mode only, or no profiling.
--mode=<string> Cutlass profiler execution mode.
--mode=profile regular verification and profiling (default)
--mode=dry_run no kernels are launched or workspaces allocated
--mode=enumerate lists all operation kind and operations
--mode=trace executes a single device-side computation with
no other kernel launches
--device-info Prints information on all GPUs present in the system
--device-info Prints information on all GPUs present in the system
--operation=<operation_kind> CUTLASS operation to profile.
--operation=<operation_kind> CUTLASS operation to profile.
--kernels=<kernel names> Names of individual kernels to execute. All are executed if not specified.
--kernels=<string_list> Filter operations by kernel names. For example, call all kernels with
("s1688" and "nt") or ("s844" and "tn" and "align8") in their
operation name using --kernels="s1688*nt, s884*tn*align8"
--ignore-kernels=<string_list> Excludes kernels whose names match anything in this list.
Device:
--device=<int> CUDA Device ID
--device=<int> CUDA Device ID
--compute-capability=<int> Override the compute capability.
--llc-capacity=<capacity in KiB> Capacity of last-level cache in kilobytes. If this is non-zero,
profiling phases cycle through different input tensors to induce
capacity misses in the L2.
Initialization:
--initialization=<bool> Enables initialization (default: true). If false, device memory is
not initialized after allocation.
--initialization=<bool> Enables initialization (default: true). If false, device memory is
not initialized after allocation.
--initialization-provider=<provider> Selects 'device' or 'host' initialization.
--initialization-provider=<provider> Selects initialization provider {host, device*}. (default: '*')
--dist=<distribution> Data distribution of input tensors
--dist=<distribution> Data distribution of input tensors {uniform*, gaussian, identity, sequential}
--dist=uniform,min:<double>,max:<double>,scale:<integer>
--dist=gaussian,mean:<double>,stddev:<double>,scale:<integer>
--dist=sequential,start:<double>,delta:<double>,scale:<integer>
--dist=identity
--seed=<int> Random number generator seed. Used to enforce deterministic
initialization.
--seed=<int> Random number generator seed. Used to enforce deterministic
initialization.
Library:
--library-algo-mode=<mode> Indicates algorithm mode used to call libraries such as cuBLAS and cuDNN.
mode={default*,matching,best}
--library-algo-mode=<mode> Indicates algorithm mode used to call libraries such as cuBLAS and cuDNN.
mode={default*,matching,best}
--library-algos=<range-list> If --algorithm-mode=best, permits specifying a selection of algorithms.
--library-algos=<range-list> If --algorithm-mode=best, permits specifying a selection of algorithms.
Profiling:
--profiling-iterations=<iterations> Number of iterations to profile each kernel. If zero, kernels
are launched up to the profiling duration.
--workspace-count=<workspace count> Number of discrete workspaces maintained to avoid cache-resident
If zero (default), the amount is chosen for each workload based on
capacity of the last-level cache.
--warmup-iterations=<iterations> Number of iterations to execute each kernel prior to profiling.
--profiling-iterations=<iterations> Number of iterations to profile each kernel. If zero, kernels
are launched up to the profiling duration.
--sleep-duration=<duration> Number of ms to sleep between profiling periods (ms)
--warmup-iterations=<iterations> Number of iterations to execute each kernel prior to profiling.
--profiling-enabled=<bool> If true, profiling is actually conducted.
--sleep-duration=<duration> Number of ms to sleep between profiling periods (ms).
--profiling-enabled=<bool> If true, profiling is actually conducted.
--providers=<providers> List of providers to be profiled for performance. (default: '*')
Gemm providers {cutlass*, cublas*}
Conv2d providers {cutlass*, cudnn*}
--providers=<providers> List of providers to be profiled for performance
Verification:
--verification-enabled=<bool> Whether to perform verification checks.
--verification-enabled=<bool> Whether to perform verification checks.
--epsilon=<error> Error threshold. Setting to zero (default) requires
bit-level equivalence.
--epsilon=<error> Error threshold. Setting to zero (default) requires
bit-level equivalence.
--nonzero-floor=<floor> Results whose absolute value is less than this quantity
are treated as zero for comparisons.
--nonzero-floor=<floor> Results whose absolute value is less than this quantity
are treated as zero for comparisons.
--save-workspace={*never,incorrect,always} Specifies when to save the GEMM inputs and results to the filesystem.
--save-workspace=<string> Specifies when to save the GEMM inputs and results to the filesystem.
--save-workspace=never never save workspace (default)
--save-workspace=incorrect save workspace for incorrect results
--save-workspace=always always save workspace
--verification-providers=<providers> List of providers used to verify result. (default: '*')
Gemm verification-providers {cublas*}
Conv2d verification-providers {cudnn*, device*, host}
--verification-providers=<providers> List of providers used to verify result. (default: cublas)
Report:
--append=<bool> If true, result is appended to possibly existing file. Otherwise,
any existing file is overwritten.
--append=<bool> If true, result is appended to possibly existing file. Otherwise,
any existing file is overwritten.
--output=<path> Path to output file for machine readable results.
--output=<path> Path to output file for machine readable results. Operation kind and '.csv' is appended.
--report-not-run=<bool> If true, reports the status of all kernels including those that
do not satisfy the given arguments.
--junit-output=<path> Path to junit output file for result reporting. Operation kind and '.junit.xml' is appended.
--tags=<column:tag,...> Inserts leading columns in output table and uniform values for each
column. Useful for generating pivot tables.
--report-not-run=<bool> If true, reports the status of all kernels including those that
do not satisfy the given arguments.
--tags=<column:tag,...> Inserts leading columns in output table and uniform values for each
column. Useful for generating pivot tables.
--verbose=<bool> Prints human-readable text to stdout. If false, nothing is written to stdout.
--verbose=<bool> If true (default), prints human-readable text to stdout.
About:
--version CUTLASS 2.2.0 built on Jun 8 2020 at 07:59:33
--version CUTLASS 2.4.0 built on Nov 19 2020 at 11:59:00
Operations:
--operation=<operation_name> Specifies a particular operation to run or print the usage statement.
gemm General matrix-matrix product. D = alpha * A*B + beta * C
spgemm Structured sparse GEMM. D = alpha * A*B + beta * C
@ -115,7 +156,7 @@ Operations:
conv3d Conv3d operation. Output(Tensor5D) = alpha * Input(Tensor5D) * Filter(Tensor5D) + beta * Input(Tensor5D)
For more details about a particular operation, specify the operation name with --help.
For details about a particular function, specify the function name with --help.
Example:
@ -125,12 +166,15 @@ Example:
$ cutlass_profiler --operation=Conv2d --help
$ cutlass_profiler --operation=SparseGemm --help
```
# GEMM
The CUTLASS Profiler is capable of executing each GEMM kernel.
The CUTLASS Profiler is capable of executing GEMM and Sparse GEMM problems.
The CUTLASS Profiler can be built with cuBLAS enabled to use as a reference implementation. If CMake detects
the cuBLASS library available in the system, it is included as a dependency. This may be explicitly overridden
with CMake flag `CUTLASS_ENABLE_CUBLAS`.
## GEMM Arguments
@ -202,7 +246,7 @@ Test your changes to gemm kernels with a quick functional test and save results
--providers=cutlass --output=functional-test.csv
```
## Example CUDA Core GEMM Operation (SGEMM)
## Example CUDA Core GEMM Operation
Example command line for profiling SGEMM kernels is as follows:
```bash
@ -239,10 +283,9 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
Note, the arguments which appear in the output may be used as command line parameters for subsequent invocations.
## Example Tensor Core GEMM Operations (S16816GEMM)
## Example Tensor Core GEMM Operations
To execute kernels targeting Tensor Core operations, supply the flag `--op_class=tensorop` in the command line.
```bash
$ ./tools/profiler/cutlass_profiler --op_class=tensorop --m=3456 --n=4096 --k=8192
@ -382,12 +425,11 @@ Profile a particular convolution (specify all the convolution parameters):
```
## Example CUDA Core Convolution Operation (SFPROP)
Example command line for profiling Convolution kernels is as follows:
## Example CUDA Core Convolution Operation
Example command line for profiling forward propagation convolution kernels on CUDA cores is as follows:
```bash
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
$ ./tools/profiler/cutlass_profiler --kernels=simt_sfprop --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
=============================
@ -419,12 +461,11 @@ reference_device: Passed
```
## Example Tensor Core Convolution Operation (S16816FPROP)
Example command line for profiling Convolution kernels is as follows:
## Example Tensor Core Convolution Operation
Example command line for profiling forward propagation convolution kernels runing on Tensor Cores is as follows:
```bash
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s16816fprop_optimized_f16_128x128_64x4_nhwc --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
$ ./tools/profiler/cutlass_profiler --kernels=tensorop*fprop --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3

View File

@ -47,6 +47,7 @@ You may also filter kernels by name by supplying a filter string with flag `CUTL
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=s16816gemm,s16816fprop*128x128
```
See more examples on selectively compiling CUTLASS GEMM and convolution kernels [here](media/docs/quickstart.md#example-cmake-commands).
You may explicitly exclude cuBLAS and cuDNN as dependencies with the following CMake flags.
- `-DCUTLASS_ENABLE_CUBLAS=OFF`
@ -87,14 +88,14 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096
Math: 13854.9 GFLOP/s
```
To execute the CUTLASS Profiler for Convolution, run the following example.
To execute the CUTLASS Profiler for convolution, run the following example.
```bash
$ ./tools/profiler/cutlass_profiler --kernels=s1688fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --pad_h=1 --pad_w=1
```
To execute all CUTLASS 2-D convolution operators, execute the following.
```bash
$ ./tools/profiler/cutlass_profiler --operation=conv2d--n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
$ ./tools/profiler/cutlass_profiler --operation=conv2d --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
=============================
@ -462,52 +463,77 @@ int main() {
}
```
Kernels can be selectively included in the CUTLASS Library by specifying filter strings when
executing CMake. For example, only single-precision GEMM kernels can be instantiated as follows.
# Example CMake Commands
To instantiate all operations supporting all tile sizes, data types, and alignment constraints, specify
`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=sgemm
$ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=all
```
The above command line generates about seven thousand kernels targetting NVIDIA Ampere, Turing, and Volta architectures.
Compiling thousands of kernels for three different architectures is time consuming. Additionaly, this would also result
in a large binary size and on some platforms linker to fail on building the library.
Enabling the "unity build" instantiates multiple kernel instances in each compilation unit, thereby reducing binary size
and avoiding linker limitations on some platforms.
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
```
It is advised to only compile CUTLASS kernels for NVIDIA architectures one plans on running. Furthermore, kernels
can be selectively included in the CUTLASS Library by specifying filter strings and wildcard characters when executing CMake.
Several examples are defined below for convenience. They may be combined as a comma-delimited list.
Compling only the kernels desired reduces compilation time.
To instantiate kernels of all tile sizes, data types, and alignment constraints, specify
`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
Several recipes are defined below for convenience. They may be combined as a comma-delimited list.
## GEMM CMake Examples
**Example.** All GEMM kernels targeting NVIDIA Ampere Tensor Cores.
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
```
**Example.** All kernels for NVIDIA Volta, Turing, and Ampere architectures. Enabling
the "unity build" instantiates multiple kernel instances in each compilation unit, thereby
reducing binary size and avoiding linker limitations on some platforms.
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
```
**Example.** All GEMM kernels targeting Turing Tensor Cores.
**Example.** All GEMM kernels targeting NVIDIA Turing Tensor Cores.
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
```
**Example.** All GEMM kernels with single-precision accumulation.
**Example.** All GEMM kernels with FP32 accumulation targeting NVIDIA Ampere, Turing, and Volta architectures.
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=s*gemm
```
**Example.** All kernels which expect A and B to be column-major.
**Example.** All kernels which expect A and B to be column-major or row-major targeting NVIDIA Ampere, Turing, and Volta architectures.
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=gemm*nn
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=gemm*nn,gemm*tt
```
**Example.** All planar complex GEMM variants.
**Example.** All planar complex GEMM variants targeting NVIDIA Ampere, Turing, and Volta architectures.
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=planar_complex
```
## Convolution CMake Examples
**Example.** All convolution kernels targeting NVIDIA Ampere's 16816 Tensor Core operation
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='80' -DCUTLASS_LIBRARY_KERNELS=s16816fprop,s16816dgrad,s16816wgrad
```
**Example.** All forward propagation (fprop) convolution kernels targeting CUDA Cores for multiple NVIDIA architectures
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='50;60;61;70;75;80' -DCUTLASS_LIBRARY_KERNELS=sfprop
```
**Example.** All forward propagation (fprop) convolution kernels with FP32 accumulation and FP16 input targetting NVIDIA Ampere's 16816 Tensor Core operation
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='80' -DCUTLASS_LIBRARY_KERNELS=s16816fprop_*_f16
```
**Example.** All backward weight gradient (wgrad) convolution kernels with FP32 accumulation, FP16 input, and optimized global memory iterator
targetting NVIDIA Ampere, Turing, and Volta Tensor Core operations
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=tensorop*s*wgrad_optimized_f16
```
# Copyright