2019-11-20 08:55:34 +08:00
|
|
|

|
|
|
|
|
|
|
|
[README](/README.md#documentation) > **CUTLASS Profiler**
|
|
|
|
|
|
|
|
# CUTLASS Profiler
|
|
|
|
|
|
|
|
The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
|
2020-11-23 10:11:37 +08:00
|
|
|
defined in the CUTLASS Instance Library. The CUTLASS Profiler is capable of executing each GEMM, Sparse Gemm,
|
|
|
|
Conv2d, and Conv3d kernel.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2019-11-21 02:42:15 +08:00
|
|
|
The CUTLASS Profiler may be compiled with:
|
|
|
|
```bash
|
2019-11-20 08:55:34 +08:00
|
|
|
$ make cutlass_profiler -j
|
|
|
|
```
|
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
To limit compilation time, only one tile size (typically 128x128) is instantiated for each data type,
|
|
|
|
math instruction, and layout. To instantiate all sizes, set the following environment variable when running CMake from an
|
|
|
|
empty `build/` directory.
|
2019-11-21 02:42:15 +08:00
|
|
|
```bash
|
2020-09-24 05:00:58 +08:00
|
|
|
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
|
2019-11-20 08:55:34 +08:00
|
|
|
...
|
|
|
|
$ make cutlass_profiler -j
|
|
|
|
```
|
2020-06-17 22:09:18 +08:00
|
|
|
Enabling the unity build places multiple kernel instances in one compilation unit, thereby reducing size of the compiled
|
|
|
|
binary and avoiding linker limitations on some platforms.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2019-11-21 02:42:15 +08:00
|
|
|
The CUTLASS Profiler sources are stored in
|
|
|
|
```bash
|
|
|
|
tools/
|
|
|
|
profiler/
|
2019-11-20 08:55:34 +08:00
|
|
|
```
|
2019-11-21 02:42:15 +08:00
|
|
|
|
|
|
|
The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
|
|
|
|
```bash
|
2019-11-20 08:55:34 +08:00
|
|
|
CUTLASS Performance Tool
|
|
|
|
usage:
|
2020-11-23 10:11:37 +08:00
|
|
|
|
2019-11-20 08:55:34 +08:00
|
|
|
cutlass_profiler [options]
|
2020-11-23 10:11:37 +08:00
|
|
|
|
2019-11-20 08:55:34 +08:00
|
|
|
--help
|
2020-11-23 10:11:37 +08:00
|
|
|
|
|
|
|
--mode=<string> Cutlass profiler execution mode.
|
|
|
|
--mode=profile regular verification and profiling (default)
|
|
|
|
--mode=dry_run no kernels are launched or workspaces allocated
|
|
|
|
--mode=enumerate lists all operation kind and operations
|
|
|
|
--mode=trace executes a single device-side computation with
|
|
|
|
no other kernel launches
|
|
|
|
|
|
|
|
--device-info Prints information on all GPUs present in the system
|
|
|
|
|
|
|
|
--operation=<operation_kind> CUTLASS operation to profile.
|
|
|
|
|
|
|
|
--kernels=<string_list> Filter operations by kernel names. For example, call all kernels with
|
|
|
|
("s1688" and "nt") or ("s844" and "tn" and "align8") in their
|
|
|
|
operation name using --kernels="s1688*nt, s884*tn*align8"
|
|
|
|
|
|
|
|
--ignore-kernels=<string_list> Excludes kernels whose names match anything in this list.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
Device:
|
2020-11-23 10:11:37 +08:00
|
|
|
--device=<int> CUDA Device ID
|
|
|
|
|
|
|
|
--compute-capability=<int> Override the compute capability.
|
|
|
|
|
|
|
|
--llc-capacity=<capacity in KiB> Capacity of last-level cache in kilobytes. If this is non-zero,
|
|
|
|
profiling phases cycle through different input tensors to induce
|
|
|
|
capacity misses in the L2.
|
|
|
|
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
Initialization:
|
2020-11-23 10:11:37 +08:00
|
|
|
--initialization=<bool> Enables initialization (default: true). If false, device memory is
|
|
|
|
not initialized after allocation.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
--initialization-provider=<provider> Selects initialization provider {host, device*}. (default: '*')
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
--dist=<distribution> Data distribution of input tensors {uniform*, gaussian, identity, sequential}
|
|
|
|
--dist=uniform,min:<double>,max:<double>,scale:<integer>
|
|
|
|
--dist=gaussian,mean:<double>,stddev:<double>,scale:<integer>
|
|
|
|
--dist=sequential,start:<double>,delta:<double>,scale:<integer>
|
|
|
|
--dist=identity
|
|
|
|
|
|
|
|
--seed=<int> Random number generator seed. Used to enforce deterministic
|
|
|
|
initialization.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
|
|
|
|
Library:
|
2020-11-23 10:11:37 +08:00
|
|
|
--library-algo-mode=<mode> Indicates algorithm mode used to call libraries such as cuBLAS and cuDNN.
|
|
|
|
mode={default*,matching,best}
|
|
|
|
|
|
|
|
--library-algos=<range-list> If --algorithm-mode=best, permits specifying a selection of algorithms.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
|
|
|
|
Profiling:
|
2020-11-23 10:11:37 +08:00
|
|
|
--workspace-count=<workspace count> Number of discrete workspaces maintained to avoid cache-resident
|
|
|
|
If zero (default), the amount is chosen for each workload based on
|
|
|
|
capacity of the last-level cache.
|
|
|
|
|
|
|
|
--profiling-iterations=<iterations> Number of iterations to profile each kernel. If zero, kernels
|
|
|
|
are launched up to the profiling duration.
|
|
|
|
|
|
|
|
--warmup-iterations=<iterations> Number of iterations to execute each kernel prior to profiling.
|
|
|
|
|
|
|
|
--sleep-duration=<duration> Number of ms to sleep between profiling periods (ms).
|
|
|
|
|
|
|
|
--profiling-enabled=<bool> If true, profiling is actually conducted.
|
|
|
|
|
2019-11-20 08:55:34 +08:00
|
|
|
Verification:
|
2020-11-23 10:11:37 +08:00
|
|
|
--verification-enabled=<bool> Whether to perform verification checks.
|
|
|
|
|
|
|
|
--epsilon=<error> Error threshold. Setting to zero (default) requires
|
|
|
|
bit-level equivalence.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
--nonzero-floor=<floor> Results whose absolute value is less than this quantity
|
|
|
|
are treated as zero for comparisons.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
--save-workspace=<string> Specifies when to save the GEMM inputs and results to the filesystem.
|
|
|
|
--save-workspace=never never save workspace (default)
|
|
|
|
--save-workspace=incorrect save workspace for incorrect results
|
|
|
|
--save-workspace=always always save workspace
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
--verification-providers=<providers> List of providers used to verify result. (default: '*')
|
|
|
|
Gemm verification-providers {cublas*}
|
|
|
|
Conv2d verification-providers {cudnn*, device*, host}
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
|
|
|
|
Report:
|
2020-11-23 10:11:37 +08:00
|
|
|
--append=<bool> If true, result is appended to possibly existing file. Otherwise,
|
|
|
|
any existing file is overwritten.
|
|
|
|
|
|
|
|
--output=<path> Path to output file for machine readable results. Operation kind and '.csv' is appended.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
--junit-output=<path> Path to junit output file for result reporting. Operation kind and '.junit.xml' is appended.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
--report-not-run=<bool> If true, reports the status of all kernels including those that
|
|
|
|
do not satisfy the given arguments.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
--tags=<column:tag,...> Inserts leading columns in output table and uniform values for each
|
|
|
|
column. Useful for generating pivot tables.
|
|
|
|
|
|
|
|
--verbose=<bool> Prints human-readable text to stdout. If false, nothing is written to stdout.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
|
|
|
|
About:
|
2020-11-23 10:11:37 +08:00
|
|
|
--version CUTLASS 2.4.0 built on Nov 19 2020 at 11:59:00
|
|
|
|
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
Operations:
|
|
|
|
|
2020-11-20 13:25:25 +08:00
|
|
|
gemm General matrix-matrix product. D = alpha * A*B + beta * C
|
|
|
|
spgemm Structured sparse GEMM. D = alpha * A*B + beta * C
|
|
|
|
conv2d Conv2d operation. Output(Tensor4D) = alpha * Input(Tensor4D) * Filter(Tensor4D) + beta * Input(Tensor4D)
|
|
|
|
conv3d Conv3d operation. Output(Tensor5D) = alpha * Input(Tensor5D) * Filter(Tensor5D) + beta * Input(Tensor5D)
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
For details about a particular function, specify the function name with --help.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
Example:
|
|
|
|
|
2020-11-20 13:25:25 +08:00
|
|
|
$ cutlass_profiler --operation=Gemm --help
|
|
|
|
|
|
|
|
$ cutlass_profiler --operation=Conv3d --help
|
|
|
|
|
|
|
|
$ cutlass_profiler --operation=Conv2d --help
|
|
|
|
|
2019-11-20 08:55:34 +08:00
|
|
|
```
|
|
|
|
|
2020-11-20 13:25:25 +08:00
|
|
|
# GEMM
|
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
The CUTLASS Profiler is capable of executing GEMM and Sparse GEMM problems.
|
|
|
|
|
|
|
|
The CUTLASS Profiler can be built with cuBLAS enabled to use as a reference implementation. If CMake detects
|
|
|
|
the cuBLASS library available in the system, it is included as a dependency. This may be explicitly overridden
|
|
|
|
with CMake flag `CUTLASS_ENABLE_CUBLAS`.
|
2020-11-20 13:25:25 +08:00
|
|
|
|
2019-11-20 08:55:34 +08:00
|
|
|
## GEMM Arguments
|
|
|
|
|
|
|
|
The complete set of arguments available to each operation may be viewed by specifying the operation name
|
|
|
|
in addition to `--help`. The argument flags and their aliases usable for GEMM appear as follows.
|
|
|
|
|
2019-11-21 02:42:15 +08:00
|
|
|
```bash
|
2019-11-20 08:55:34 +08:00
|
|
|
$ ./tools/profiler/cutlass_profiler --operation=gemm --help
|
|
|
|
|
|
|
|
GEMM
|
|
|
|
|
|
|
|
[enum] --Gemm_kind Variant of GEMM (e.g. gemm, batched, ...)
|
|
|
|
[int] --m,--problem-size::m M dimension of the GEMM problem space
|
|
|
|
[int] --n,--problem-size::n N dimension of the GEMM problem space
|
|
|
|
[int] --k,--problem-size::k K dimension of the GEMM problem space
|
|
|
|
[tensor] --A Tensor storing the A operand
|
|
|
|
[tensor] --B Tensor storing the B operand
|
|
|
|
[tensor] --C Tensor storing the C operand
|
|
|
|
[scalar] --alpha,--epilogue::alpha Epilogue scalar alpha
|
|
|
|
[scalar] --beta,--epilogue::beta Epilogue scalar beta
|
|
|
|
[int] --split_k_slices Number of partitions of K dimension
|
|
|
|
[int] --batch_count Number of GEMMs computed in one batch
|
|
|
|
[enum] --op_class,--opcode-class Class of math instruction (SIMT or TensorOp).
|
|
|
|
[enum] --accum,--accumulator-type Math instruction accumulator data type.
|
|
|
|
[int] --cta_m,--threadblock-shape::m Threadblock shape in the M dimension.
|
|
|
|
[int] --cta_n,--threadblock-shape::n Threadblock shape in the N dimension.
|
|
|
|
[int] --cta_k,--threadblock-shape::k Threadblock shape in the K dimension.
|
|
|
|
[int] --stages,--threadblock-stages Number of stages of threadblock-scoped matrix multiply.
|
|
|
|
[int] --warps_m,--warp-count::m Number of warps within threadblock along the M dimension.
|
|
|
|
[int] --warps_n,--warp-count::n Number of warps within threadblock along the N dimension.
|
|
|
|
[int] --warps_k,--warp-count::k Number of warps within threadblock along the K dimension.
|
|
|
|
[int] --inst_m,--instruction-shape::m Math instruction shape in the M dimension.
|
|
|
|
[int] --inst_n,--instruction-shape::n Math instruction shape in the N dimension.
|
|
|
|
[int] --inst_k,--instruction-shape::k Math instruction shape in the K dimension.
|
|
|
|
[int] --min_cc,--minimum-compute-capability Minimum device compute capability.
|
|
|
|
[int] --max_cc,--maximum-compute-capability Maximum device compute capability.
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
Profile a particular problem size:
|
|
|
|
$ ./tools/profiler/cutlass_profiler --operation=Gemm --m=1024 --n=1024 --k=128
|
|
|
|
|
|
|
|
Schmoo over problem size and beta:
|
|
|
|
$ ./tools/profiler/cutlass_profiler --operation=Gemm --m=1024:4096:256 --n=1024:4096:256 --k=128:8192:128 --beta=0,1,2
|
|
|
|
|
|
|
|
Schmoo over accumulator types:
|
|
|
|
$ ./tools/profiler/cutlass_profiler --operation=Gemm --accumulator-type=f16,f32
|
|
|
|
|
|
|
|
Run when A is f16 with column-major and B is any datatype with row-major
|
|
|
|
(For column major, use column, col, or n. For row major use, row or t):
|
|
|
|
|
|
|
|
$ ./tools/profiler/cutlass_profiler --operation=Gemm --A=f16:column --B=*:row
|
|
|
|
|
|
|
|
Using various input value distribution:
|
|
|
|
$ ./tools/profiler/cutlass_profiler --operation=Gemm --dist=uniform,min:0,max:3
|
|
|
|
$ ./tools/profiler/cutlass_profiler --operation=Gemm --dist=gaussian,mean:0,stddev:3
|
|
|
|
$ ./tools/profiler/cutlass_profiler --operation=Gemm --dist=sequential,start:0,delta:1
|
|
|
|
|
|
|
|
Run a kernel with cta tile size of 256x128x32 and save workspace if results are incorrect
|
|
|
|
(note that --cta-tile::k=32 is default cta-tile size):
|
|
|
|
$ ./tools/profiler/cutlass_profiler --operation=Gemm --cta_m=256 --cta_n=128 --cta_k=32 --save-workspace=incorrect
|
|
|
|
|
|
|
|
Test your changes to gemm kernels with a quick functional test and save results in functional-test.csv:
|
|
|
|
$ ./tools/profiler/cutlass_profiler --operation=Gemm \
|
|
|
|
--m=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \
|
|
|
|
--n=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \
|
|
|
|
--k=8,16,32,64,128,256,288,384,504,512,520 \
|
|
|
|
--beta=0,1,2 --profiling-iterations=1 \
|
2021-07-31 02:02:17 +08:00
|
|
|
--output=functional-test.csv
|
2019-11-20 08:55:34 +08:00
|
|
|
```
|
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
## Example CUDA Core GEMM Operation
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
Example command line for profiling SGEMM kernels is as follows:
|
2019-11-21 02:42:15 +08:00
|
|
|
```bash
|
2020-06-09 07:17:35 +08:00
|
|
|
$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
|
|
|
|
|
|
|
|
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
=============================
|
|
|
|
Problem ID: 1
|
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
Provider: CUTLASS
|
|
|
|
OperationKind: gemm
|
|
|
|
Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1
|
|
|
|
|
|
|
|
Status: Success
|
|
|
|
Verification: ON
|
|
|
|
Disposition: Passed
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
cuBLAS: Passed
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1 \
|
|
|
|
--batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4 \
|
|
|
|
--warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
Bytes: 180355072 bytes
|
|
|
|
FLOPs: 115992428544 flops
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
Runtime: 6.73655 ms
|
|
|
|
Memory: 24.934 GiB/s
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
Math: 17218.4 GFLOP/s
|
2019-11-20 08:55:34 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
Note, the arguments which appear in the output may be used as command line parameters for subsequent invocations.
|
|
|
|
|
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
## Example Tensor Core GEMM Operations
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
To execute kernels targeting Tensor Core operations, supply the flag `--op_class=tensorop` in the command line.
|
2019-11-21 02:42:15 +08:00
|
|
|
```bash
|
2020-06-09 07:17:35 +08:00
|
|
|
$ ./tools/profiler/cutlass_profiler --op_class=tensorop --m=3456 --n=4096 --k=8192
|
|
|
|
|
|
|
|
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
=============================
|
|
|
|
Problem ID: 1
|
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
Provider: CUTLASS
|
|
|
|
OperationKind: gemm
|
|
|
|
Operation: cutlass_tensorop_s16816gemm_f16_256x128_32x3_nn_align8
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
Status: Success
|
|
|
|
Verification: ON
|
|
|
|
Disposition: Passed
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
cuBLAS: Passed
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
Arguments: --m=3456 --n=4096 --k=8192 --A=f16:column --B=f16:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1 \
|
|
|
|
--batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128 --cta_k=32 --stages=3 --warps_m=4 \
|
|
|
|
--warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
Bytes: 180355072 bytes
|
|
|
|
FLOPs: 231956545536 flops
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
Runtime: 0.98647 ms
|
|
|
|
Memory: 170.272 GiB/s
|
2019-11-20 08:55:34 +08:00
|
|
|
|
2020-06-09 07:17:35 +08:00
|
|
|
Math: 235138 GFLOP/s
|
2019-11-20 08:55:34 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
## Covering the problem space
|
|
|
|
|
|
|
|
All arguments may have single values or comma-delimited set of values. Integers may also be specified
|
|
|
|
as an inclusive range with the following syntax `start:end:increment` or simply `start:end`.
|
|
|
|
|
|
|
|
For example, the following sweeps over the range of the GEMM K dimension from 8 to 4096 in increments
|
|
|
|
of 8 elements.
|
|
|
|
|
2019-11-21 02:42:15 +08:00
|
|
|
```bash
|
2019-11-20 08:55:34 +08:00
|
|
|
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn --m=4352 --n=4096 --k=8:4096:8
|
|
|
|
```
|
|
|
|
|
|
|
|
## Output
|
|
|
|
|
|
|
|
By default, runtime and computed GFLOP/s are reported for each operation and problem size. Additionally,
|
|
|
|
a table of comma separated values are reported at the end of the execution. This may be output to a file
|
|
|
|
with the `--output=<filename.csv>` command line option as shown:
|
|
|
|
|
2019-11-21 02:42:15 +08:00
|
|
|
```bash
|
|
|
|
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn \
|
2020-06-09 07:17:35 +08:00
|
|
|
--m=3456 --n=4096 --k=8:4096:8 --output=report.csv
|
2019-11-20 08:55:34 +08:00
|
|
|
```
|
|
|
|
|
|
|
|
To faclitate generation of pivot tables and charts, additional columns may be prepended with the
|
|
|
|
`--tags=<column>:<value>` option. One or more tags may be specified using a comma-delimited list.
|
|
|
|
|
2019-11-21 02:42:15 +08:00
|
|
|
```bash
|
|
|
|
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn \
|
2020-06-09 07:17:35 +08:00
|
|
|
--m=3456 --n=4096 --k=8:4096:8 --output=report.csv \
|
|
|
|
--tags=cutlass:2.2,date:2020-06-08
|
2019-11-20 08:55:34 +08:00
|
|
|
```
|
|
|
|
|
2020-11-20 13:25:25 +08:00
|
|
|
# Convolution
|
|
|
|
|
|
|
|
The CUTLASS Profiler is capable of executing 2-D and 3-D convolution problems for forwards and backwards
|
|
|
|
oeprator variants.
|
|
|
|
|
|
|
|
The CUTLASS Profiler can be built with cuDNN enabled to use as a reference implementation. If CMake detects
|
|
|
|
the cuDNN library available in the system, it is included as a dependency. This may be explicitly overridden
|
|
|
|
with CMake flag `CUTLASS_ENABLE_CUDNN`.
|
|
|
|
|
|
|
|
```bash
|
|
|
|
$ cmake .. -DCUTLASS_LIBRARY_OPERATIONS=conv2d -DCUTLASS_ENABLE_CUDNN=OFF
|
|
|
|
...
|
|
|
|
$ make -j16 cutlass_profiler
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Convolution Arguments
|
|
|
|
|
|
|
|
```bash
|
|
|
|
$ ./tools/profiler/cutlass_profiler --help --operation=Conv2d
|
|
|
|
|
|
|
|
Conv2d
|
|
|
|
|
|
|
|
[enum] --conv_kind Convolutional operator (fprop, dgrad, wgrad)
|
|
|
|
[int] --n,--input_n Input N dimension of the Conv2d problem space
|
|
|
|
[int] --h,--input_h Input H dimension of the Conv2d problem space
|
|
|
|
[int] --w,--input_w Input W dimension of the Conv2d problem space
|
|
|
|
[int] --c,--input_c Input C dimension of the Conv2d problem space
|
|
|
|
[int] --k,--filter_k Filter K dimension of the Conv2d problem space
|
|
|
|
[int] --r,--filter_r Filter R dimension of the Conv2d problem space
|
|
|
|
[int] --s,--filter_s Filter S dimension of the Conv2d problem space
|
|
|
|
[int] --p,--output_p Output P dimension of the Conv2d problem space
|
|
|
|
[int] --q,--output_q Output Q dimension of the Conv2d problem space
|
|
|
|
[int] --pad_h Padding in H direction
|
|
|
|
[int] --pad_w Padding in W direction
|
|
|
|
[int] --stride_h Stride in H direction
|
|
|
|
[int] --stride_w Stride in W direction
|
|
|
|
[int] --dilation_h Dilation in H direction
|
|
|
|
[int] --dilation_w Dilation in W direction
|
|
|
|
[tensor] --Activation Tensor storing the Activation operand
|
|
|
|
[tensor] --Filter Tensor storing the Filter operand
|
|
|
|
[tensor] --Output Tensor storing the Output operand
|
|
|
|
[enum] --conv_mode Convolution filter mode (conv, cross)
|
|
|
|
[enum] --iterator_algorithm,--iterator_algo Convolution iterator algorithm (analytic, optimized)
|
|
|
|
[scalar] --alpha,--epilogue::alpha Epilogue scalar alpha
|
|
|
|
[scalar] --beta,--epilogue::beta Epilogue scalar beta
|
|
|
|
[enum] --split_k_mode,--split-k-mode SplitK mode for serial or parallel reduction (serial, parallel)
|
|
|
|
[int] --split_k_slices,--split-k-slices Number of partitions of K dimension
|
|
|
|
[enum] --eq_gemm_provider,--eq-gemm-provider Enable profiling equivalent gemm by the following providers (cutlass)
|
|
|
|
[enum] --op_class,--opcode-class Class of math instruction (simt, tensorop, wmmatensorop, wmma)
|
|
|
|
[enum] --accum,--accumulator-type Math instruction accumulator data type
|
|
|
|
[int] --cta_m,--threadblock-shape::m Threadblock shape in the M dimension
|
|
|
|
[int] --cta_n,--threadblock-shape::n Threadblock shape in the N dimension
|
|
|
|
[int] --cta_k,--threadblock-shape::k Threadblock shape in the K dimension
|
|
|
|
[int] --stages,--threadblock-stages Number of stages of threadblock-scoped matrix multiply
|
|
|
|
[int] --warps_m,--warp-count::m Number of warps within threadblock along the M dimension
|
|
|
|
[int] --warps_n,--warp-count::n Number of warps within threadblock along the N dimension
|
|
|
|
[int] --warps_k,--warp-count::k Number of warps within threadblock along the K dimension
|
|
|
|
[int] --inst_m,--instruction-shape::m Math instruction shape in the M dimension
|
|
|
|
[int] --inst_n,--instruction-shape::n Math instruction shape in the N dimension
|
|
|
|
[int] --inst_k,--instruction-shape::k Math instruction shape in the K dimension
|
|
|
|
[int] --min_cc,--minimum-compute-capability Minimum device compute capability
|
|
|
|
[int] --max_cc,--maximum-compute-capability Maximum device compute capability
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
|
|
|
|
Profile a particular convolution (specify all the convolution parameters):
|
|
|
|
|
|
|
|
$ cutlass_profiler --operation=Conv2d --Activation=f16:nhwc \
|
|
|
|
--Filter=f16:nhwc --Output=f16 --accumulator-type=f32 \
|
|
|
|
--n=32 --h=14 --w=14 --c=8 --k=64 --r=3 --s=3 \
|
|
|
|
--pad_h=1 --pad_w=1 \
|
|
|
|
--stride::h=1 --stride::w=1 --dilation::h=1 --dilation::w=1
|
|
|
|
|
|
|
|
```
|
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
## Example CUDA Core Convolution Operation
|
2020-11-20 13:25:25 +08:00
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
Example command line for profiling forward propagation convolution kernels on CUDA cores is as follows:
|
2020-11-20 13:25:25 +08:00
|
|
|
```bash
|
2020-11-23 10:11:37 +08:00
|
|
|
$ ./tools/profiler/cutlass_profiler --kernels=simt_sfprop --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
|
2020-11-20 13:25:25 +08:00
|
|
|
|
|
|
|
|
|
|
|
=============================
|
|
|
|
Problem ID: 1
|
|
|
|
|
|
|
|
Provider: CUTLASS
|
|
|
|
OperationKind: conv2d
|
|
|
|
Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
|
|
|
|
|
|
|
|
Status: Success
|
|
|
|
Verification: ON
|
|
|
|
Disposition: Passed
|
|
|
|
|
|
|
|
reference_device: Passed
|
|
|
|
|
|
|
|
Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \
|
|
|
|
--stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc \
|
|
|
|
--conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \
|
|
|
|
--eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4 \
|
|
|
|
--warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
|
|
|
|
|
|
|
|
Bytes: 2055798784 bytes
|
|
|
|
FLOPs: 118482796544 flops
|
|
|
|
|
|
|
|
Runtime: 8.13237 ms
|
|
|
|
Memory: 235.431 GiB/s
|
|
|
|
|
|
|
|
Math: 14569.3 GFLOP/s
|
|
|
|
|
|
|
|
```
|
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
## Example Tensor Core Convolution Operation
|
2020-11-20 13:25:25 +08:00
|
|
|
|
2020-11-23 10:11:37 +08:00
|
|
|
Example command line for profiling forward propagation convolution kernels runing on Tensor Cores is as follows:
|
2020-11-20 13:25:25 +08:00
|
|
|
```bash
|
2020-11-23 10:11:37 +08:00
|
|
|
$ ./tools/profiler/cutlass_profiler --kernels=tensorop*fprop --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
|
2020-11-20 13:25:25 +08:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
=============================
|
|
|
|
Problem ID: 1
|
|
|
|
|
|
|
|
Provider: CUTLASS
|
|
|
|
OperationKind: conv2d
|
|
|
|
Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_64x4_nhwc
|
|
|
|
|
|
|
|
Status: Success
|
|
|
|
Verification: ON
|
|
|
|
Disposition: Passed
|
|
|
|
|
|
|
|
reference_device: Passed
|
|
|
|
|
|
|
|
Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1 \
|
|
|
|
--stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc \
|
|
|
|
--conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1 \
|
|
|
|
--eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=64 --stages=4 \
|
|
|
|
--warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
|
|
|
|
|
|
|
|
Bytes: 1130659840 bytes
|
|
|
|
FLOPs: 118482796544 flops
|
|
|
|
|
|
|
|
Runtime: 0.945071 ms
|
|
|
|
Memory: 1114.21 GiB/s
|
|
|
|
|
|
|
|
Math: 125369 GFLOP/s
|
|
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
2019-11-20 08:55:34 +08:00
|
|
|
# Copyright
|
|
|
|
|
2021-02-26 22:58:26 +08:00
|
|
|
Copyright (c) 2017-2021, NVIDIA CORPORATION. All rights reserved.
|
2019-11-20 08:55:34 +08:00
|
|
|
|
|
|
|
```
|
|
|
|
Redistribution and use in source and binary forms, with or without modification, are permitted
|
|
|
|
provided that the following conditions are met:
|
|
|
|
* Redistributions of source code must retain the above copyright notice, this list of
|
|
|
|
conditions and the following disclaimer.
|
|
|
|
* Redistributions in binary form must reproduce the above copyright notice, this list of
|
|
|
|
conditions and the following disclaimer in the documentation and/or other materials
|
|
|
|
provided with the distribution.
|
|
|
|
* Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
|
|
|
|
to endorse or promote products derived from this software without specific prior written
|
|
|
|
permission.
|
|
|
|
|
|
|
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
|
|
|
|
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
|
|
|
|
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
|
|
|
|
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
|
|
|
|
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
|
|
|
|
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
|
2021-07-28 08:58:30 +08:00
|
|
|
STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
2019-11-20 08:55:34 +08:00
|
|
|
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
|
|
```
|