cutlass 2.4 documentation only update
This commit is contained in:
parent
e6bcdc60cf
commit
ccb697bac7
@ -8,7 +8,7 @@
|
||||
* Spatial dimensions: 1-D, 2-D, and 3-D
|
||||
* Layout: NHWC, NCxHWx
|
||||
* Implicit GEMM convolution components:
|
||||
* Global memory iterators supporting fprop, dgrad, and wgrad
|
||||
* Global memory iterators supporting Fprop, Dgrad, and Wgrad
|
||||
* `MmaMultistage` for implicit GEMM convolution for NVIDIA Ampere architecture
|
||||
* `MmaPipeline` for implicit GEMM convolution for NVIDIA Volta and Turing architectures
|
||||
* [Documentation](/media/docs/implicit_gemm_convolution.md) describing Implicit GEMM Convolution algorithm and implementation
|
||||
|
133
README.md
133
README.md
@ -288,6 +288,7 @@ It can be built as follows:
|
||||
```bash
|
||||
$ make cutlass_profiler -j16
|
||||
```
|
||||
## Building all GEMM and Convolution kernels (_long_ build times)
|
||||
|
||||
By default, only one tile size is instantiated for each data type, math instruction, and layout.
|
||||
To instantiate all, set the following environment variable when running CMake from an empty `build/` directory.
|
||||
@ -298,17 +299,71 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all
|
||||
$ make cutlass_profiler -j16
|
||||
```
|
||||
|
||||
To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with
|
||||
wildcard characters may be reduce the set of kernels. The following builds exactly one kernel:
|
||||
## Building a subset of GEMM and Convolution kernels (_reduced_ build times)
|
||||
|
||||
To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with
|
||||
wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one
|
||||
or a subset of kernels for NVIDIA Ampere and Turing architecture:
|
||||
|
||||
### Building a subset Tensor Core GEMM kernels
|
||||
|
||||
To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targetting NVIDIA Ampere and Turing architecture,
|
||||
use the below cmake command line:
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
|
||||
...
|
||||
$ make cutlass_profiler -j16
|
||||
```
|
||||
|
||||
Example command line for profiling SGEMM kernels is as follows:
|
||||
Example command line for profiling a subset of Tensor Core GEMM kernels is as follows:
|
||||
```bash
|
||||
./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096
|
||||
|
||||
...
|
||||
=============================
|
||||
Problem ID: 1
|
||||
|
||||
Provider: CUTLASS
|
||||
OperationKind: gemm
|
||||
Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8
|
||||
|
||||
Status: Success
|
||||
Verification: ON
|
||||
Disposition: Passed
|
||||
|
||||
reference_device: Passed
|
||||
cuBLAS: Passed
|
||||
|
||||
Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1 \
|
||||
--beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128 \
|
||||
--cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 \
|
||||
--max_cc=1024
|
||||
|
||||
Bytes: 118489088 bytes
|
||||
FLOPs: 115992428544 flops
|
||||
|
||||
Runtime: 1.55948 ms
|
||||
Memory: 70.7616 GiB/s
|
||||
|
||||
Math: 74378.8 GFLOP/s
|
||||
|
||||
|
||||
|
||||
=============================
|
||||
...
|
||||
```
|
||||
|
||||
### Building one CUDA Core GEMM kernel
|
||||
|
||||
To compile one SGEMM kernel targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
|
||||
...
|
||||
$ make cutlass_profiler -j16
|
||||
```
|
||||
|
||||
Example command line for profiling single SGEMM CUDA kernel is as follows:
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
|
||||
|
||||
=============================
|
||||
@ -335,24 +390,69 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
|
||||
Memory: 24.934 GiB/s
|
||||
|
||||
Math: 17218.4 GFLOP/s
|
||||
|
||||
=============================
|
||||
```
|
||||
|
||||
To compile strictly 2-D or 3-D convolution kernels, filter by operation
|
||||
### Building a subset of Tensor Core Convolution kernels
|
||||
|
||||
To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation
|
||||
and FP16 input targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_OPERATIONS=conv2d,conv3d
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
|
||||
...
|
||||
$ make cutlass_profiler -j16
|
||||
```
|
||||
|
||||
or by name
|
||||
Example command line for profiling a subset of Tensor Core convolution kernels is as follows:
|
||||
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=sfprop,s16816fprop,s16816dgrad,s16816wgrad
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
|
||||
|
||||
...
|
||||
=============================
|
||||
Problem ID: 1
|
||||
|
||||
Provider: CUTLASS
|
||||
OperationKind: conv2d
|
||||
Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc
|
||||
|
||||
Status: Success
|
||||
Verification: ON
|
||||
Disposition: Passed
|
||||
|
||||
reference_device: Passed
|
||||
|
||||
Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \
|
||||
--stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc \
|
||||
--conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \
|
||||
--eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5 \
|
||||
--warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
|
||||
|
||||
Bytes: 1130659840 bytes
|
||||
FLOPs: 118482796544 flops
|
||||
|
||||
Runtime: 0.711496 ms
|
||||
Memory: 1479.99 GiB/s
|
||||
|
||||
Math: 166526 GFLOP/s
|
||||
|
||||
=============================
|
||||
...
|
||||
```
|
||||
|
||||
|
||||
### Building one Convolution CUDA kernel
|
||||
|
||||
To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation
|
||||
and FP32 input targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
|
||||
...
|
||||
$ make cutlass_profiler -j16
|
||||
```
|
||||
|
||||
Example command line for profiling 2-D convolution kernels is as follows:
|
||||
Example command line for profiling one CUDA Core convolution kernel:
|
||||
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
|
||||
@ -380,14 +480,21 @@ reference_device: Passed
|
||||
Bytes: 2055798784 bytes
|
||||
FLOPs: 118482796544 flops
|
||||
|
||||
Runtime: 8.13237 ms
|
||||
Memory: 235.431 GiB/s
|
||||
Runtime: 7.34266 ms
|
||||
Memory: 260.752 GiB/s
|
||||
|
||||
Math: 14569.3 GFLOP/s
|
||||
Math: 16136.2 GFLOP/s
|
||||
|
||||
|
||||
=============================
|
||||
|
||||
```
|
||||
|
||||
[Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)
|
||||
## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler
|
||||
- Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
|
||||
- [GEMM CMake Examples](media/docs/quickstart.md#gemm-cmake-examples)
|
||||
- [Implicit GEMM conovlution CMake Examples](media/docs/quickstart.md#convolution-cmake-examples)
|
||||
- [Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)
|
||||
|
||||
|
||||
# About
|
||||
|
@ -56,14 +56,15 @@ One can find and/or create equivalent dgrad and wgrad convolutional operators.
|
||||
| **Simt** | 50,60,61,70,75 | 9.2+ | `cf32 * cf32 + cf32 => cf32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm50.cu) |
|
||||
| **TensorOp** | 70 | 10.1+ | `f16 * f16 + f32 => {f16, f32}`| NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm70.cu) |
|
||||
| **TensorOp** | 75 | 10.2+ | `f16 * f16 + f32 => {f16, f32}`| NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm75.cu) |
|
||||
| **TensorOp** | 75 | 10.2+ | `s8 * s8 + s32 => {s32, s8}` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu) |
|
||||
| **Simt** | 80 | 11.0+ | `f32 * f32 + f32 => f32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu) |
|
||||
| **Simt** | 80 | 11.0+ | `cf32 * cf32 + cf32 => cf32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu) |
|
||||
| **TensorOp** | 75 | 10.2+ | `s8 * s8 + s32 => {s32, s8}` | NHWC, NCxHWx | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm75.cu) |
|
||||
| **TensorOp** | 75 | 10.2+ | `s4 * s4 + s32 => {s32, s4}` | NHWC, NCxHWx | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm75.cu) |
|
||||
| **Simt** | 80 | 11.0+ | `f32 * f32 + f32 => f32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu) |
|
||||
| **Simt** | 80 | 11.0+ | `cf32 * cf32 + cf32 => cf32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu) |
|
||||
| **TensorOp** | 80 | 11.0+ | `f16 * f16 + f32 => {f16, f32}`| NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
|
||||
| **TensorOp** | 80 | 11.0+ | `f16 * f16 + f16 => f16` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
|
||||
| **TensorOp** | 80 | 11.0+ | `tf32 * tf32 + f32 => f32` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
|
||||
| **TensorOp** | 80 | 11.0+ | `s8 * s8 + s32 => {s32, s8}` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu) |
|
||||
| **TensorOp** | 80 | 11.0+ | `s4 * s4 + s32 => {s32, s4}` | NHWC | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu) |
|
||||
| **TensorOp** | 80 | 11.0+ | `s8 * s8 + s32 => {s32, s8}` | NHWC, NCxHWx | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm80.cu) |
|
||||
| **TensorOp** | 80 | 11.0+ | `s4 * s4 + s32 => {s32, s4}` | NHWC, NCxHWx | [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm80.cu) |
|
||||
|
||||
|
||||
|
||||
|
@ -51,7 +51,7 @@ f(p, r) = p * stride_h + R - r - 1 + pad_h
|
||||
g(q, s) = h * stride_w + S - s - 1 + pad_w
|
||||
```
|
||||
|
||||
A [host](/tools/util/include/reference/host/convolution.h) and [device](/tools/util/include/reference/device/convolution.h)
|
||||
A [host](/tools/util/include/cutlass/util/reference/host/convolution.h) and [device](/tools/util/include/cutlass/util/reference/device/convolution.h)
|
||||
reference implementation are provided in the CUTLASS Utilities.
|
||||
|
||||
This computation may be mapped to the elements of a matrix product as follows.
|
||||
@ -347,7 +347,7 @@ creating GEMM-B tile in shared memory.
|
||||
The improvements covered by optimized iterators are:
|
||||
- (a) Precomputing kernel-invariant pointer deltas on the host
|
||||
- (b) Computing cta-invariant mask predicates on device-side iterator ctors
|
||||
- (c) Use of [fast divmod](include/cutlass/fast_math.h) to map GEMM dimenstions to convolution tensors.
|
||||
- (c) Use of [fast divmod](/include/cutlass/fast_math.h) to map GEMM dimenstions to convolution tensors.
|
||||
For example, _optimized_ activation iterator uses fast divmod to map GEMM _M_ to NPQ
|
||||
for activation iterator
|
||||
|
||||
|
@ -5,15 +5,17 @@
|
||||
# CUTLASS Profiler
|
||||
|
||||
The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
|
||||
defined in the CUTLASS Instance Library.
|
||||
defined in the CUTLASS Instance Library. The CUTLASS Profiler is capable of executing each GEMM, Sparse Gemm,
|
||||
Conv2d, and Conv3d kernel.
|
||||
|
||||
The CUTLASS Profiler may be compiled with:
|
||||
```bash
|
||||
$ make cutlass_profiler -j
|
||||
```
|
||||
|
||||
To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
|
||||
To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
|
||||
To limit compilation time, only one tile size (typically 128x128) is instantiated for each data type,
|
||||
math instruction, and layout. To instantiate all sizes, set the following environment variable when running CMake from an
|
||||
empty `build/` directory.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
|
||||
...
|
||||
@ -32,82 +34,121 @@ The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profi
|
||||
```bash
|
||||
CUTLASS Performance Tool
|
||||
usage:
|
||||
|
||||
cutlass_profiler [options]
|
||||
|
||||
--help
|
||||
|
||||
--mode={profile*,single,dry,enumerate} Regular profiling, single kernel mode only, or no profiling.
|
||||
--mode=<string> Cutlass profiler execution mode.
|
||||
--mode=profile regular verification and profiling (default)
|
||||
--mode=dry_run no kernels are launched or workspaces allocated
|
||||
--mode=enumerate lists all operation kind and operations
|
||||
--mode=trace executes a single device-side computation with
|
||||
no other kernel launches
|
||||
|
||||
--device-info Prints information on all GPUs present in the system
|
||||
--device-info Prints information on all GPUs present in the system
|
||||
|
||||
--operation=<operation_kind> CUTLASS operation to profile.
|
||||
--operation=<operation_kind> CUTLASS operation to profile.
|
||||
|
||||
--kernels=<kernel names> Names of individual kernels to execute. All are executed if not specified.
|
||||
--kernels=<string_list> Filter operations by kernel names. For example, call all kernels with
|
||||
("s1688" and "nt") or ("s844" and "tn" and "align8") in their
|
||||
operation name using --kernels="s1688*nt, s884*tn*align8"
|
||||
|
||||
--ignore-kernels=<string_list> Excludes kernels whose names match anything in this list.
|
||||
|
||||
Device:
|
||||
--device=<int> CUDA Device ID
|
||||
--device=<int> CUDA Device ID
|
||||
|
||||
--compute-capability=<int> Override the compute capability.
|
||||
|
||||
--llc-capacity=<capacity in KiB> Capacity of last-level cache in kilobytes. If this is non-zero,
|
||||
profiling phases cycle through different input tensors to induce
|
||||
capacity misses in the L2.
|
||||
|
||||
|
||||
Initialization:
|
||||
--initialization=<bool> Enables initialization (default: true). If false, device memory is
|
||||
not initialized after allocation.
|
||||
--initialization=<bool> Enables initialization (default: true). If false, device memory is
|
||||
not initialized after allocation.
|
||||
|
||||
--initialization-provider=<provider> Selects 'device' or 'host' initialization.
|
||||
--initialization-provider=<provider> Selects initialization provider {host, device*}. (default: '*')
|
||||
|
||||
--dist=<distribution> Data distribution of input tensors
|
||||
--dist=<distribution> Data distribution of input tensors {uniform*, gaussian, identity, sequential}
|
||||
--dist=uniform,min:<double>,max:<double>,scale:<integer>
|
||||
--dist=gaussian,mean:<double>,stddev:<double>,scale:<integer>
|
||||
--dist=sequential,start:<double>,delta:<double>,scale:<integer>
|
||||
--dist=identity
|
||||
|
||||
--seed=<int> Random number generator seed. Used to enforce deterministic
|
||||
initialization.
|
||||
|
||||
--seed=<int> Random number generator seed. Used to enforce deterministic
|
||||
initialization.
|
||||
|
||||
Library:
|
||||
--library-algo-mode=<mode> Indicates algorithm mode used to call libraries such as cuBLAS and cuDNN.
|
||||
mode={default*,matching,best}
|
||||
--library-algo-mode=<mode> Indicates algorithm mode used to call libraries such as cuBLAS and cuDNN.
|
||||
mode={default*,matching,best}
|
||||
|
||||
--library-algos=<range-list> If --algorithm-mode=best, permits specifying a selection of algorithms.
|
||||
|
||||
--library-algos=<range-list> If --algorithm-mode=best, permits specifying a selection of algorithms.
|
||||
|
||||
Profiling:
|
||||
--profiling-iterations=<iterations> Number of iterations to profile each kernel. If zero, kernels
|
||||
are launched up to the profiling duration.
|
||||
--workspace-count=<workspace count> Number of discrete workspaces maintained to avoid cache-resident
|
||||
If zero (default), the amount is chosen for each workload based on
|
||||
capacity of the last-level cache.
|
||||
|
||||
--warmup-iterations=<iterations> Number of iterations to execute each kernel prior to profiling.
|
||||
--profiling-iterations=<iterations> Number of iterations to profile each kernel. If zero, kernels
|
||||
are launched up to the profiling duration.
|
||||
|
||||
--sleep-duration=<duration> Number of ms to sleep between profiling periods (ms)
|
||||
--warmup-iterations=<iterations> Number of iterations to execute each kernel prior to profiling.
|
||||
|
||||
--profiling-enabled=<bool> If true, profiling is actually conducted.
|
||||
--sleep-duration=<duration> Number of ms to sleep between profiling periods (ms).
|
||||
|
||||
--profiling-enabled=<bool> If true, profiling is actually conducted.
|
||||
|
||||
--providers=<providers> List of providers to be profiled for performance. (default: '*')
|
||||
Gemm providers {cutlass*, cublas*}
|
||||
Conv2d providers {cutlass*, cudnn*}
|
||||
|
||||
--providers=<providers> List of providers to be profiled for performance
|
||||
|
||||
Verification:
|
||||
--verification-enabled=<bool> Whether to perform verification checks.
|
||||
--verification-enabled=<bool> Whether to perform verification checks.
|
||||
|
||||
--epsilon=<error> Error threshold. Setting to zero (default) requires
|
||||
bit-level equivalence.
|
||||
--epsilon=<error> Error threshold. Setting to zero (default) requires
|
||||
bit-level equivalence.
|
||||
|
||||
--nonzero-floor=<floor> Results whose absolute value is less than this quantity
|
||||
are treated as zero for comparisons.
|
||||
--nonzero-floor=<floor> Results whose absolute value is less than this quantity
|
||||
are treated as zero for comparisons.
|
||||
|
||||
--save-workspace={*never,incorrect,always} Specifies when to save the GEMM inputs and results to the filesystem.
|
||||
--save-workspace=<string> Specifies when to save the GEMM inputs and results to the filesystem.
|
||||
--save-workspace=never never save workspace (default)
|
||||
--save-workspace=incorrect save workspace for incorrect results
|
||||
--save-workspace=always always save workspace
|
||||
|
||||
--verification-providers=<providers> List of providers used to verify result. (default: '*')
|
||||
Gemm verification-providers {cublas*}
|
||||
Conv2d verification-providers {cudnn*, device*, host}
|
||||
|
||||
--verification-providers=<providers> List of providers used to verify result. (default: cublas)
|
||||
|
||||
Report:
|
||||
--append=<bool> If true, result is appended to possibly existing file. Otherwise,
|
||||
any existing file is overwritten.
|
||||
--append=<bool> If true, result is appended to possibly existing file. Otherwise,
|
||||
any existing file is overwritten.
|
||||
|
||||
--output=<path> Path to output file for machine readable results.
|
||||
--output=<path> Path to output file for machine readable results. Operation kind and '.csv' is appended.
|
||||
|
||||
--report-not-run=<bool> If true, reports the status of all kernels including those that
|
||||
do not satisfy the given arguments.
|
||||
--junit-output=<path> Path to junit output file for result reporting. Operation kind and '.junit.xml' is appended.
|
||||
|
||||
--tags=<column:tag,...> Inserts leading columns in output table and uniform values for each
|
||||
column. Useful for generating pivot tables.
|
||||
--report-not-run=<bool> If true, reports the status of all kernels including those that
|
||||
do not satisfy the given arguments.
|
||||
|
||||
--tags=<column:tag,...> Inserts leading columns in output table and uniform values for each
|
||||
column. Useful for generating pivot tables.
|
||||
|
||||
--verbose=<bool> Prints human-readable text to stdout. If false, nothing is written to stdout.
|
||||
|
||||
--verbose=<bool> If true (default), prints human-readable text to stdout.
|
||||
|
||||
About:
|
||||
--version CUTLASS 2.2.0 built on Jun 8 2020 at 07:59:33
|
||||
--version CUTLASS 2.4.0 built on Nov 19 2020 at 11:59:00
|
||||
|
||||
|
||||
Operations:
|
||||
--operation=<operation_name> Specifies a particular operation to run or print the usage statement.
|
||||
|
||||
gemm General matrix-matrix product. D = alpha * A*B + beta * C
|
||||
spgemm Structured sparse GEMM. D = alpha * A*B + beta * C
|
||||
@ -115,7 +156,7 @@ Operations:
|
||||
conv3d Conv3d operation. Output(Tensor5D) = alpha * Input(Tensor5D) * Filter(Tensor5D) + beta * Input(Tensor5D)
|
||||
|
||||
|
||||
For more details about a particular operation, specify the operation name with --help.
|
||||
For details about a particular function, specify the function name with --help.
|
||||
|
||||
Example:
|
||||
|
||||
@ -125,12 +166,15 @@ Example:
|
||||
|
||||
$ cutlass_profiler --operation=Conv2d --help
|
||||
|
||||
$ cutlass_profiler --operation=SparseGemm --help
|
||||
```
|
||||
|
||||
# GEMM
|
||||
|
||||
The CUTLASS Profiler is capable of executing each GEMM kernel.
|
||||
The CUTLASS Profiler is capable of executing GEMM and Sparse GEMM problems.
|
||||
|
||||
The CUTLASS Profiler can be built with cuBLAS enabled to use as a reference implementation. If CMake detects
|
||||
the cuBLASS library available in the system, it is included as a dependency. This may be explicitly overridden
|
||||
with CMake flag `CUTLASS_ENABLE_CUBLAS`.
|
||||
|
||||
## GEMM Arguments
|
||||
|
||||
@ -202,7 +246,7 @@ Test your changes to gemm kernels with a quick functional test and save results
|
||||
--providers=cutlass --output=functional-test.csv
|
||||
```
|
||||
|
||||
## Example CUDA Core GEMM Operation (SGEMM)
|
||||
## Example CUDA Core GEMM Operation
|
||||
|
||||
Example command line for profiling SGEMM kernels is as follows:
|
||||
```bash
|
||||
@ -239,10 +283,9 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
|
||||
Note, the arguments which appear in the output may be used as command line parameters for subsequent invocations.
|
||||
|
||||
|
||||
## Example Tensor Core GEMM Operations (S16816GEMM)
|
||||
## Example Tensor Core GEMM Operations
|
||||
|
||||
To execute kernels targeting Tensor Core operations, supply the flag `--op_class=tensorop` in the command line.
|
||||
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --op_class=tensorop --m=3456 --n=4096 --k=8192
|
||||
|
||||
@ -382,12 +425,11 @@ Profile a particular convolution (specify all the convolution parameters):
|
||||
|
||||
```
|
||||
|
||||
## Example CUDA Core Convolution Operation (SFPROP)
|
||||
|
||||
Example command line for profiling Convolution kernels is as follows:
|
||||
## Example CUDA Core Convolution Operation
|
||||
|
||||
Example command line for profiling forward propagation convolution kernels on CUDA cores is as follows:
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=simt_sfprop --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
|
||||
|
||||
|
||||
=============================
|
||||
@ -419,12 +461,11 @@ reference_device: Passed
|
||||
|
||||
```
|
||||
|
||||
## Example Tensor Core Convolution Operation (S16816FPROP)
|
||||
|
||||
Example command line for profiling Convolution kernels is as follows:
|
||||
## Example Tensor Core Convolution Operation
|
||||
|
||||
Example command line for profiling forward propagation convolution kernels runing on Tensor Cores is as follows:
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s16816fprop_optimized_f16_128x128_64x4_nhwc --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=tensorop*fprop --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
|
||||
|
||||
|
||||
|
||||
|
@ -47,6 +47,7 @@ You may also filter kernels by name by supplying a filter string with flag `CUTL
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=s16816gemm,s16816fprop*128x128
|
||||
```
|
||||
See more examples on selectively compiling CUTLASS GEMM and convolution kernels [here](media/docs/quickstart.md#example-cmake-commands).
|
||||
|
||||
You may explicitly exclude cuBLAS and cuDNN as dependencies with the following CMake flags.
|
||||
- `-DCUTLASS_ENABLE_CUBLAS=OFF`
|
||||
@ -87,14 +88,14 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096
|
||||
Math: 13854.9 GFLOP/s
|
||||
```
|
||||
|
||||
To execute the CUTLASS Profiler for Convolution, run the following example.
|
||||
To execute the CUTLASS Profiler for convolution, run the following example.
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=s1688fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --pad_h=1 --pad_w=1
|
||||
```
|
||||
|
||||
To execute all CUTLASS 2-D convolution operators, execute the following.
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --operation=conv2d--n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
|
||||
$ ./tools/profiler/cutlass_profiler --operation=conv2d --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
|
||||
|
||||
|
||||
=============================
|
||||
@ -462,52 +463,77 @@ int main() {
|
||||
}
|
||||
```
|
||||
|
||||
Kernels can be selectively included in the CUTLASS Library by specifying filter strings when
|
||||
executing CMake. For example, only single-precision GEMM kernels can be instantiated as follows.
|
||||
# Example CMake Commands
|
||||
|
||||
To instantiate all operations supporting all tile sizes, data types, and alignment constraints, specify
|
||||
`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=sgemm
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=all
|
||||
```
|
||||
The above command line generates about seven thousand kernels targetting NVIDIA Ampere, Turing, and Volta architectures.
|
||||
Compiling thousands of kernels for three different architectures is time consuming. Additionaly, this would also result
|
||||
in a large binary size and on some platforms linker to fail on building the library.
|
||||
|
||||
Enabling the "unity build" instantiates multiple kernel instances in each compilation unit, thereby reducing binary size
|
||||
and avoiding linker limitations on some platforms.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
|
||||
```
|
||||
|
||||
It is advised to only compile CUTLASS kernels for NVIDIA architectures one plans on running. Furthermore, kernels
|
||||
can be selectively included in the CUTLASS Library by specifying filter strings and wildcard characters when executing CMake.
|
||||
|
||||
Several examples are defined below for convenience. They may be combined as a comma-delimited list.
|
||||
Compling only the kernels desired reduces compilation time.
|
||||
|
||||
To instantiate kernels of all tile sizes, data types, and alignment constraints, specify
|
||||
`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
|
||||
|
||||
Several recipes are defined below for convenience. They may be combined as a comma-delimited list.
|
||||
|
||||
## GEMM CMake Examples
|
||||
**Example.** All GEMM kernels targeting NVIDIA Ampere Tensor Cores.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
|
||||
```
|
||||
|
||||
**Example.** All kernels for NVIDIA Volta, Turing, and Ampere architectures. Enabling
|
||||
the "unity build" instantiates multiple kernel instances in each compilation unit, thereby
|
||||
reducing binary size and avoiding linker limitations on some platforms.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
|
||||
```
|
||||
|
||||
**Example.** All GEMM kernels targeting Turing Tensor Cores.
|
||||
**Example.** All GEMM kernels targeting NVIDIA Turing Tensor Cores.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
|
||||
```
|
||||
|
||||
**Example.** All GEMM kernels with single-precision accumulation.
|
||||
**Example.** All GEMM kernels with FP32 accumulation targeting NVIDIA Ampere, Turing, and Volta architectures.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=s*gemm
|
||||
```
|
||||
|
||||
**Example.** All kernels which expect A and B to be column-major.
|
||||
**Example.** All kernels which expect A and B to be column-major or row-major targeting NVIDIA Ampere, Turing, and Volta architectures.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=gemm*nn
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=gemm*nn,gemm*tt
|
||||
```
|
||||
|
||||
**Example.** All planar complex GEMM variants.
|
||||
**Example.** All planar complex GEMM variants targeting NVIDIA Ampere, Turing, and Volta architectures.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=planar_complex
|
||||
```
|
||||
|
||||
## Convolution CMake Examples
|
||||
**Example.** All convolution kernels targeting NVIDIA Ampere's 16816 Tensor Core operation
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='80' -DCUTLASS_LIBRARY_KERNELS=s16816fprop,s16816dgrad,s16816wgrad
|
||||
```
|
||||
|
||||
**Example.** All forward propagation (fprop) convolution kernels targeting CUDA Cores for multiple NVIDIA architectures
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='50;60;61;70;75;80' -DCUTLASS_LIBRARY_KERNELS=sfprop
|
||||
```
|
||||
|
||||
**Example.** All forward propagation (fprop) convolution kernels with FP32 accumulation and FP16 input targetting NVIDIA Ampere's 16816 Tensor Core operation
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='80' -DCUTLASS_LIBRARY_KERNELS=s16816fprop_*_f16
|
||||
```
|
||||
|
||||
**Example.** All backward weight gradient (wgrad) convolution kernels with FP32 accumulation, FP16 input, and optimized global memory iterator
|
||||
targetting NVIDIA Ampere, Turing, and Volta Tensor Core operations
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=tensorop*s*wgrad_optimized_f16
|
||||
```
|
||||
|
||||
# Copyright
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user