574 lines
		
	
	
		
			29 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			574 lines
		
	
	
		
			29 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| 
 | |
| 
 | |
| [README](/README.md#documentation) > **CUTLASS Profiler**
 | |
| 
 | |
| # CUTLASS Profiler
 | |
| 
 | |
| The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
 | |
| defined in the CUTLASS Instance Library. The CUTLASS Profiler is capable of executing each GEMM, Sparse Gemm, 
 | |
| Conv2d, and Conv3d kernel.
 | |
| 
 | |
| The CUTLASS Profiler may be compiled with:
 | |
| ```bash
 | |
| $ make cutlass_profiler -j
 | |
| ```
 | |
| 
 | |
| To limit compilation time, only one tile size (typically 128x128) and threadblock cluster size (typically 2x1x1) is instantiated for each data type, 
 | |
| math instruction, and layout. To instantiate all sizes, set the following environment variable when running CMake from an 
 | |
| empty `build/` directory.
 | |
| ```bash
 | |
| $ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all  -DCUTLASS_UNITY_BUILD_ENABLED=ON
 | |
| ...
 | |
| $ make cutlass_profiler -j
 | |
| ```
 | |
| Enabling the unity build places multiple kernel instances in one compilation unit, thereby reducing size of the compiled
 | |
| binary and avoiding linker limitations on some platforms.
 | |
| 
 | |
| The CUTLASS Profiler sources are stored in 
 | |
| ```bash
 | |
| tools/
 | |
|   profiler/
 | |
| ```
 | |
| 
 | |
| The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
 | |
| ```bash
 | |
| CUTLASS Performance Tool
 | |
| usage:
 | |
| 
 | |
|     cutlass_profiler [options]
 | |
| 
 | |
|   --help
 | |
| 
 | |
|   --mode=<string>                                  Cutlass profiler execution mode.
 | |
|                                                     --mode=profile    regular verification and profiling (default)
 | |
|                                                     --mode=dry_run    no kernels are launched or workspaces allocated
 | |
|                                                     --mode=enumerate  lists all operation kind and operations
 | |
|                                                     --mode=trace      executes a single device-side computation with
 | |
|                                                                        no other kernel launches
 | |
| 
 | |
|   --device-info                                    Prints information on all GPUs present in the system
 | |
| 
 | |
|   --operation=<operation_kind>                     CUTLASS operation to profile.
 | |
| 
 | |
|   --kernels=<string_list>                          Filter operations by kernel names. For example, call all kernels with
 | |
|                                                    ("s1688" and "nt") or ("s844" and "tn" and "align8") in their
 | |
|                                                    operation name using --kernels="s1688*nt, s884*tn*align8"
 | |
| 
 | |
|   --ignore-kernels=<string_list>                   Excludes kernels whose names match anything in this list.
 | |
| 
 | |
| Device:
 | |
|   --device=<int>                                   CUDA Device ID
 | |
| 
 | |
|   --compute-capability=<int>                       Override the compute capability.
 | |
| 
 | |
|   --llc-capacity=<capacity in KiB>                 Capacity of last-level cache in kilobytes. If this is non-zero,
 | |
|                                                    profiling phases cycle through different input tensors to induce
 | |
|                                                    capacity misses in the L2.
 | |
| 
 | |
| 
 | |
| Initialization:
 | |
|   --initialization=<bool>                          Enables initialization (default: true). If false, device memory is
 | |
|                                                    not initialized after allocation.
 | |
| 
 | |
|   --initialization-provider=<provider>             Selects initialization provider {host, device*}. (default: '*')
 | |
| 
 | |
|   --dist=<distribution>                            Data distribution of input tensors {uniform*, gaussian, identity, sequential}
 | |
|                                                     --dist=uniform,min:<double>,max:<double>,scale:<integer>
 | |
|                                                     --dist=gaussian,mean:<double>,stddev:<double>,scale:<integer>
 | |
|                                                     --dist=sequential,start:<double>,delta:<double>,scale:<integer>
 | |
|                                                     --dist=identity
 | |
| 
 | |
|   --seed=<int>                                     Random number generator seed. Used to enforce deterministic
 | |
|                                                    initialization.
 | |
| 
 | |
| 
 | |
| Library:
 | |
|   --library-algo-mode=<mode>                       Indicates algorithm mode used to call libraries such as cuBLAS and cuDNN.
 | |
|                                                    mode={default*,matching,best}
 | |
| 
 | |
|   --library-algos=<range-list>                     If --algorithm-mode=best, permits specifying a selection of algorithms.
 | |
| 
 | |
| 
 | |
| Profiling:
 | |
|   --workspace-count=<workspace count>              Number of discrete workspaces maintained to avoid cache-resident 
 | |
|                                                  If zero (default), the amount is chosen for each workload based on 
 | |
|                                                  capacity of the last-level cache.
 | |
| 
 | |
|   --profiling-iterations=<iterations>              Number of iterations to profile each kernel. If zero, kernels
 | |
|                                                    are launched up to the profiling duration.
 | |
| 
 | |
|   --warmup-iterations=<iterations>                 Number of iterations to execute each kernel prior to profiling.
 | |
| 
 | |
|   --sleep-duration=<duration>                      Number of ms to sleep between profiling periods (ms).
 | |
| 
 | |
|   --profiling-enabled=<bool>                       If true, profiling is actually conducted.
 | |
| 
 | |
| Verification:
 | |
|   --verification-enabled=<bool>                    Whether to perform verification checks.
 | |
| 
 | |
|   --epsilon=<error>                                Error threshold. Setting to zero (default) requires
 | |
|                                                    bit-level equivalence.
 | |
| 
 | |
|   --nonzero-floor=<floor>                          Results whose absolute value is less than this quantity
 | |
|                                                    are treated as zero for comparisons.
 | |
| 
 | |
|   --save-workspace=<string>                        Specifies when to save the GEMM inputs and results to the filesystem.
 | |
|                                                     --save-workspace=never      never save workspace (default)
 | |
|                                                     --save-workspace=incorrect  save workspace for incorrect results
 | |
|                                                     --save-workspace=always     always save workspace
 | |
| 
 | |
|   --verification-providers=<providers>             List of providers used to verify result. (default: '*')
 | |
|                                                    Gemm verification-providers {cublas*}
 | |
|                                                    Conv2d verification-providers {cudnn*, device*, host}
 | |
| 
 | |
| 
 | |
| Report:
 | |
|   --append=<bool>                                  If true, result is appended to possibly existing file. Otherwise, 
 | |
|                                                    any existing file is overwritten.
 | |
| 
 | |
|   --output=<path>                                  Path to output file for machine readable results. Operation kind and '.csv' is appended.
 | |
| 
 | |
|   --junit-output=<path>                            Path to junit output file for result reporting. Operation kind and '.junit.xml' is appended.
 | |
| 
 | |
|   --report-not-run=<bool>                          If true, reports the status of all kernels including those that
 | |
|                                                    do not satisfy the given arguments.
 | |
| 
 | |
|   --tags=<column:tag,...>                          Inserts leading columns in output table and uniform values for each
 | |
|                                                    column. Useful for generating pivot tables.
 | |
| 
 | |
|   --verbose=<bool>                                 Prints human-readable text to stdout. If false, nothing is written to stdout.
 | |
| 
 | |
| 
 | |
| About:
 | |
|   --version                                        CUTLASS 2.4.0 built on Nov 19 2020 at 11:59:00
 | |
| 
 | |
| 
 | |
| Operations:
 | |
| 
 | |
|      gemm                                          General matrix-matrix product. D = alpha * A*B + beta * C
 | |
|      spgemm                                        Structured sparse GEMM. D = alpha * A*B + beta * C
 | |
|      conv2d                                        Conv2d operation. Output(Tensor4D) = alpha * Input(Tensor4D) * Filter(Tensor4D) + beta * Input(Tensor4D)
 | |
|      conv3d                                        Conv3d operation. Output(Tensor5D) = alpha * Input(Tensor5D) * Filter(Tensor5D) + beta * Input(Tensor5D)
 | |
| 
 | |
| 
 | |
| For details about a particular function, specify the function name with --help.
 | |
| 
 | |
| Example:
 | |
| 
 | |
|   $ cutlass_profiler --operation=Gemm --help
 | |
| 
 | |
|   $ cutlass_profiler --operation=Conv3d --help
 | |
| 
 | |
|   $ cutlass_profiler --operation=Conv2d --help
 | |
| 
 | |
| ```
 | |
| 
 | |
| # GEMM
 | |
| 
 | |
| The CUTLASS Profiler is capable of executing GEMM and Sparse GEMM problems.
 | |
| 
 | |
| The CUTLASS Profiler can be built with cuBLAS enabled to use as a reference implementation. If CMake detects
 | |
| the cuBLAS library available in the system, it is included as a dependency. This may be explicitly overridden
 | |
| with CMake flag `CUTLASS_ENABLE_CUBLAS`.
 | |
| 
 | |
| ## GEMM Arguments
 | |
| 
 | |
| The complete set of arguments available to each operation may be viewed by specifying the operation name
 | |
| in addition to `--help`. The argument flags and their aliases usable for GEMM appear as follows.
 | |
| 
 | |
| ```bash
 | |
| $ ./tools/profiler/cutlass_profiler --operation=gemm --help
 | |
| 
 | |
| GEMM
 | |
| 
 | |
|   [enum]      --Gemm_kind                                       Variant of GEMM (e.g. gemm, batched, ...)
 | |
|   [int]       --m,--problem-size::m                             M dimension of the GEMM problem space
 | |
|   [int]       --n,--problem-size::n                             N dimension of the GEMM problem space
 | |
|   [int]       --k,--problem-size::k                             K dimension of the GEMM problem space
 | |
|   [tensor]    --A                                               Tensor storing the A operand
 | |
|   [tensor]    --B                                               Tensor storing the B operand
 | |
|   [tensor]    --C                                               Tensor storing the C operand
 | |
|   [scalar]    --alpha,--epilogue::alpha                         Epilogue scalar alpha
 | |
|   [scalar]    --beta,--epilogue::beta                           Epilogue scalar beta
 | |
|   [int]       --split_k_slices                                  Number of partitions of K dimension
 | |
|   [int]       --batch_count                                     Number of GEMMs computed in one batch
 | |
|   [enum]      --op_class,--opcode-class                         Class of math instruction (SIMT or TensorOp).
 | |
|   [enum]      --accum,--accumulator-type                        Math instruction accumulator data type.
 | |
|   [int]       --cta_m,--threadblock-shape::m                    Threadblock shape in the M dimension.
 | |
|   [int]       --cta_n,--threadblock-shape::n                    Threadblock shape in the N dimension.
 | |
|   [int]       --cta_k,--threadblock-shape::k                    Threadblock shape in the K dimension.
 | |
|   [int]       --cluster_m,--cluster-shape-shape::m              Cluster shape in the M dimension.
 | |
|   [int]       --cluster_n,--cluster-shape-shape::n              Cluster shape in the N dimension.
 | |
|   [int]       --cluster_k,--cluster-shape-shape::k              Cluster shape in the K dimension.
 | |
|   [int]       --stages,--threadblock-stages                     Number of stages of threadblock-scoped matrix multiply.
 | |
|   [int]       --warps_m,--warp-count::m                         Number of warps within threadblock along the M dimension.
 | |
|   [int]       --warps_n,--warp-count::n                         Number of warps within threadblock along the N dimension.
 | |
|   [int]       --warps_k,--warp-count::k                         Number of warps within threadblock along the K dimension.
 | |
|   [int]       --inst_m,--instruction-shape::m                   Math instruction shape in the M dimension.
 | |
|   [int]       --inst_n,--instruction-shape::n                   Math instruction shape in the N dimension.
 | |
|   [int]       --inst_k,--instruction-shape::k                   Math instruction shape in the K dimension.
 | |
|   [int]       --min_cc,--minimum-compute-capability             Minimum device compute capability.
 | |
|   [int]       --max_cc,--maximum-compute-capability             Maximum device compute capability.
 | |
| 
 | |
| Examples:
 | |
| 
 | |
| Profile a particular problem size:
 | |
|   $ ./tools/profiler/cutlass_profiler --operation=Gemm --m=1024 --n=1024 --k=128
 | |
| 
 | |
| Schmoo over problem size and beta:
 | |
|   $ ./tools/profiler/cutlass_profiler --operation=Gemm --m=1024:4096:256 --n=1024:4096:256 --k=128:8192:128 --beta=0,1,2
 | |
| 
 | |
| Schmoo over accumulator types:
 | |
|   $ ./tools/profiler/cutlass_profiler --operation=Gemm --accumulator-type=f16,f32
 | |
| 
 | |
| Run when A is f16 with column-major and B is any datatype with row-major 
 | |
| (For column major, use column, col, or n. For row major use, row or t):
 | |
| 
 | |
|   $ ./tools/profiler/cutlass_profiler --operation=Gemm --A=f16:column --B=*:row
 | |
| 
 | |
| Using various input value distribution:
 | |
|   $ ./tools/profiler/cutlass_profiler --operation=Gemm --dist=uniform,min:0,max:3
 | |
|   $ ./tools/profiler/cutlass_profiler --operation=Gemm --dist=gaussian,mean:0,stddev:3
 | |
|   $ ./tools/profiler/cutlass_profiler --operation=Gemm --dist=sequential,start:0,delta:1
 | |
| 
 | |
| Run a kernel with cta tile size of 256x128x32 and save workspace if results are incorrect 
 | |
| (note that --cta-tile::k=32 is default cta-tile size):
 | |
|  $ ./tools/profiler/cutlass_profiler --operation=Gemm --cta_m=256 --cta_n=128  --cta_k=32 --save-workspace=incorrect
 | |
| 
 | |
| Test your changes to gemm kernels with a quick functional test and save results in functional-test.csv:
 | |
|  $ ./tools/profiler/cutlass_profiler  --operation=Gemm \
 | |
|    --m=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \
 | |
|    --n=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \
 | |
|    --k=8,16,32,64,128,256,288,384,504,512,520 \
 | |
|    --beta=0,1,2 --profiling-iterations=1 \
 | |
|    --output=functional-test.csv
 | |
| ```
 | |
| 
 | |
| ## Example CUDA Core GEMM Operation
 | |
| 
 | |
| Example command line for profiling SGEMM kernels is as follows:
 | |
| ```bash
 | |
| $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
 | |
| 
 | |
| 
 | |
| 
 | |
| =============================
 | |
|   Problem ID: 1
 | |
| 
 | |
|         Provider: CUTLASS
 | |
|    OperationKind: gemm
 | |
|        Operation: cutlass_simt_sgemm_128x128_8x2_nn_align1
 | |
| 
 | |
|           Status: Success
 | |
|     Verification: ON
 | |
|      Disposition: Passed
 | |
| 
 | |
|           cuBLAS: Passed
 | |
| 
 | |
|        Arguments: --m=3456 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1  \
 | |
|                   --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
 | |
|                   --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
 | |
| 
 | |
|            Bytes: 180355072  bytes
 | |
|            FLOPs: 115992428544  flops
 | |
| 
 | |
|          Runtime: 6.73655  ms
 | |
|           Memory: 24.934 GiB/s
 | |
| 
 | |
|             Math: 17218.4 GFLOP/s
 | |
| ```
 | |
| 
 | |
| Note, the arguments which appear in the output may be used as command line parameters for subsequent invocations.
 | |
| 
 | |
| 
 | |
| ## Example Tensor Core GEMM Operations
 | |
| 
 | |
| To execute kernels targeting Tensor Core operations, supply the flag `--op_class=tensorop` in the command line.
 | |
| ```bash
 | |
| $ ./tools/profiler/cutlass_profiler --op_class=tensorop --m=3456 --n=4096 --k=8192
 | |
| 
 | |
| 
 | |
| 
 | |
| =============================
 | |
|   Problem ID: 1
 | |
| 
 | |
|         Provider: CUTLASS
 | |
|    OperationKind: gemm
 | |
|        Operation: cutlass_tensorop_s16816gemm_f16_256x128_32x3_nn_align8
 | |
| 
 | |
|           Status: Success
 | |
|     Verification: ON
 | |
|      Disposition: Passed
 | |
| 
 | |
|           cuBLAS: Passed
 | |
| 
 | |
|        Arguments: --m=3456 --n=4096 --k=8192 --A=f16:column --B=f16:column --C=f32:column --alpha=1 --beta=0 --split_k_slices=1  \
 | |
|                   --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128 --cta_k=32 --stages=3 --warps_m=4  \
 | |
|                   --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
 | |
| 
 | |
|            Bytes: 180355072  bytes
 | |
|            FLOPs: 231956545536  flops
 | |
| 
 | |
|          Runtime: 0.98647  ms
 | |
|           Memory: 170.272 GiB/s
 | |
| 
 | |
|             Math: 235138 GFLOP/s
 | |
| ```
 | |
| 
 | |
| ## Covering the problem space
 | |
| 
 | |
| All arguments may have single values or comma-delimited set of values. Integers may also be specified
 | |
| as an inclusive range with the following syntax `start:end:increment` or simply `start:end`. 
 | |
| 
 | |
| For example, the following sweeps over the range of the GEMM K dimension from 8 to 4096 in increments
 | |
| of 8 elements.
 | |
| 
 | |
| ```bash
 | |
| $ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn --m=4352 --n=4096 --k=8:4096:8
 | |
| ```
 | |
| 
 | |
| ## Output
 | |
| 
 | |
| By default, runtime and computed GFLOP/s are reported for each operation and problem size. Additionally,
 | |
| a table of comma separated values are reported at the end of the execution. This may be output to a file
 | |
| with the `--output=<filename.csv>` command line option as shown:
 | |
| 
 | |
| ```bash
 | |
| $ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn            \
 | |
|                                     --m=3456 --n=4096 --k=8:4096:8 --output=report.csv
 | |
| ```
 | |
| 
 | |
| To faclitate generation of pivot tables and charts, additional columns may be prepended with the
 | |
| `--tags=<column>:<value>` option. One or more tags may be specified using a comma-delimited list.
 | |
| 
 | |
| ```bash
 | |
| $ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn            \
 | |
|                                     --m=3456 --n=4096 --k=8:4096:8 --output=report.csv \
 | |
|                                     --tags=cutlass:2.2,date:2020-06-08
 | |
| ```
 | |
| 
 | |
| ## CUTLASS 3.0 GEMM procedural names
 | |
| 
 | |
| CUTLASS 3.0 introduces a new naming convention for GEMMs used by the profiler targeting the NVIDIA
 | |
| Hopper architecture and beyond so as to indicate new features of the kernel within the name
 | |
| (e.g., the cluster shape).
 | |
| 
 | |
| To best illustrate this naming convention, we will walk through the meaning of each of the components
 | |
| in a GEMM kernel used by the profiler:
 | |
| ```
 | |
| cutlass3x_sm90_tensorop_s64x128x16gemm_f16_f16_f32_f32_128x128x64_2x1x1_0_ntn_align8
 | |
| ```
 | |
| 
 | |
| The components within this name are as follows:
 | |
| 
 | |
| * `cutlass3x`: indicates that the kernel was generated through the CUTLASS 3.0 API
 | |
| * `sm90`: indicates that the kernel targets NVIDIA GPUs with compute capability 90
 | |
| * `tensorop`: indicates that the kernel makes use of NVIDIA Tensor Cores
 | |
| (as opposed to `simt`, which indicates the use of "CUDA cores")
 | |
| * `s`: indicates that the Tensor Core instruction being used accumulates in single precision
 | |
| (as opposed to `h`, which indicates half precision)
 | |
| * `64x128x16gemm`: indicates that the shape of the Tensor Core instruction being used (MxNxK) is 64x128x16
 | |
| * `f16_f16_f32_f16`: indicates that the data types for operands A, B, and C are each `f16`
 | |
| (half precision) and that accumulation is performed using `f32` (single precision)
 | |
| * `128x128x64`: indicates that the thread block shape used in the GEMM (MxNxK) is 128x128x64
 | |
| * `2x1x1`: indicates that the cluster shape being used is 2x1x1
 | |
| * `0`: indicates that the kernel uses the CollectiveBuilder's automatic stage calculation to determine the
 | |
| number of pipeline stages in the kernel. Note that `0` does not mean that no stages are used. A nonzero value indicates that automatic stage calculation is not performed and indicates the number of pipeline stages to be used.
 | |
| This 0 is only added to the kernel's procedural name, the profiler will still report the actual stage count
 | |
| when printing the kernel argument details (`--stages=N`) and kernel discovery will still support filtering through the `--stages` argument.
 | |
| * `ntn`: indicates that the layouts for operands A, B, and C are column major ("n"; non-transposed),
 | |
| row major ("t"; transposed), and column major, respectively.
 | |
| * `align8`: indicates that the maximum alignment between operands A and B is 8.
 | |
| 
 | |
| Note that in some special cases where the input A/B types do not match that of the MMA
 | |
| instruction's, the MMA facing input type is added to the instruction string as well.
 | |
| 
 | |
| ```
 | |
| cutlass3x_sm90_tensorop_s64x128x8tf32gemm_f32_f32_f32_f32_128x128x32_2x1x1_0_tnn_align4
 | |
| ```
 | |
| 
 | |
| * `s64x128x8tf32gemm`: indicates that the MMA consumes inputs in `tf32` format, and therefore
 | |
| the kernel performs rounding of the `f32` values in global memory while loading them into shared memory.
 | |
| 
 | |
| # Convolution
 | |
| 
 | |
| The CUTLASS Profiler is capable of executing 2-D and 3-D convolution problems for forwards and backwards
 | |
| operator variants.
 | |
| 
 | |
| The CUTLASS Profiler can be built with cuDNN enabled to use as a reference implementation. If CMake detects
 | |
| the cuDNN library available in the system, it is included as a dependency. This may be explicitly overridden
 | |
| with CMake flag `CUTLASS_ENABLE_CUDNN`. 
 | |
| 
 | |
| ```bash
 | |
| $ cmake .. -DCUTLASS_LIBRARY_OPERATIONS=conv2d -DCUTLASS_ENABLE_CUDNN=OFF
 | |
| ...
 | |
| $ make -j16 cutlass_profiler
 | |
| ```
 | |
| 
 | |
| 
 | |
| ## Convolution Arguments
 | |
| 
 | |
| ```bash
 | |
| $ ./tools/profiler/cutlass_profiler --help --operation=Conv2d
 | |
| 
 | |
| Conv2d
 | |
| 
 | |
|   [enum]      --conv_kind                                       Convolutional operator (fprop, dgrad, wgrad)
 | |
|   [int]       --n,--input_n                                     Input N dimension of the Conv2d problem space
 | |
|   [int]       --h,--input_h                                     Input H dimension of the Conv2d problem space
 | |
|   [int]       --w,--input_w                                     Input W dimension of the Conv2d problem space
 | |
|   [int]       --c,--input_c                                     Input C dimension of the Conv2d problem space
 | |
|   [int]       --k,--filter_k                                    Filter K dimension of the Conv2d problem space
 | |
|   [int]       --r,--filter_r                                    Filter R dimension of the Conv2d problem space
 | |
|   [int]       --s,--filter_s                                    Filter S dimension of the Conv2d problem space
 | |
|   [int]       --p,--output_p                                    Output P dimension of the Conv2d problem space
 | |
|   [int]       --q,--output_q                                    Output Q dimension of the Conv2d problem space
 | |
|   [int]       --pad_h                                           Padding in H direction
 | |
|   [int]       --pad_w                                           Padding in W direction
 | |
|   [int]       --stride_h                                        Stride in H direction
 | |
|   [int]       --stride_w                                        Stride in W direction
 | |
|   [int]       --dilation_h                                      Dilation in H direction
 | |
|   [int]       --dilation_w                                      Dilation in W direction
 | |
|   [tensor]    --Activation                                      Tensor storing the Activation operand
 | |
|   [tensor]    --Filter                                          Tensor storing the Filter operand
 | |
|   [tensor]    --Output                                          Tensor storing the Output operand
 | |
|   [enum]      --conv_mode                                       Convolution filter mode (conv, cross)
 | |
|   [enum]      --iterator_algorithm,--iterator_algo              Convolution iterator algorithm (analytic, optimized)
 | |
|   [scalar]    --alpha,--epilogue::alpha                         Epilogue scalar alpha
 | |
|   [scalar]    --beta,--epilogue::beta                           Epilogue scalar beta
 | |
|   [enum]      --split_k_mode,--split-k-mode                     SplitK mode for serial or parallel reduction (serial, parallel)
 | |
|   [int]       --split_k_slices,--split-k-slices                 Number of partitions of K dimension
 | |
|   [enum]      --eq_gemm_provider,--eq-gemm-provider             Enable profiling equivalent gemm by the following providers (cutlass)
 | |
|   [enum]      --op_class,--opcode-class                         Class of math instruction (simt, tensorop, wmmatensorop, wmma)
 | |
|   [enum]      --accum,--accumulator-type                        Math instruction accumulator data type
 | |
|   [int]       --cta_m,--threadblock-shape::m                    Threadblock shape in the M dimension
 | |
|   [int]       --cta_n,--threadblock-shape::n                    Threadblock shape in the N dimension
 | |
|   [int]       --cta_k,--threadblock-shape::k                    Threadblock shape in the K dimension
 | |
|   [int]       --stages,--threadblock-stages                     Number of stages of threadblock-scoped matrix multiply
 | |
|   [int]       --warps_m,--warp-count::m                         Number of warps within threadblock along the M dimension
 | |
|   [int]       --warps_n,--warp-count::n                         Number of warps within threadblock along the N dimension
 | |
|   [int]       --warps_k,--warp-count::k                         Number of warps within threadblock along the K dimension
 | |
|   [int]       --inst_m,--instruction-shape::m                   Math instruction shape in the M dimension
 | |
|   [int]       --inst_n,--instruction-shape::n                   Math instruction shape in the N dimension
 | |
|   [int]       --inst_k,--instruction-shape::k                   Math instruction shape in the K dimension
 | |
|   [int]       --min_cc,--minimum-compute-capability             Minimum device compute capability
 | |
|   [int]       --max_cc,--maximum-compute-capability             Maximum device compute capability
 | |
| 
 | |
| Examples:
 | |
| 
 | |
| Profile a particular convolution (specify all the convolution parameters):
 | |
| 
 | |
|  $ cutlass_profiler --operation=Conv2d --Activation=f16:nhwc   \
 | |
|   --Filter=f16:nhwc --Output=f16 --accumulator-type=f32        \
 | |
|   --n=32 --h=14 --w=14 --c=8 --k=64 --r=3 --s=3                \
 | |
|   --pad_h=1 --pad_w=1                                          \
 | |
|   --stride::h=1 --stride::w=1 --dilation::h=1 --dilation::w=1
 | |
| 
 | |
| ```
 | |
| 
 | |
| ## Example CUDA Core Convolution Operation
 | |
| 
 | |
| Example command line for profiling forward propagation convolution kernels on CUDA cores is as follows:
 | |
| ```bash
 | |
| $ ./tools/profiler/cutlass_profiler --kernels=simt_sfprop  --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
 | |
| 
 | |
| 
 | |
| =============================
 | |
|   Problem ID: 1
 | |
| 
 | |
|         Provider: CUTLASS
 | |
|    OperationKind: conv2d
 | |
|        Operation: cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
 | |
| 
 | |
|           Status: Success
 | |
|     Verification: ON
 | |
|      Disposition: Passed
 | |
| 
 | |
| reference_device: Passed
 | |
| 
 | |
|        Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
 | |
|                   --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f32:nhwc --Filter=f32:nhwc --Output=f32:nhwc  \
 | |
|                   --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
 | |
|                   --eq_gemm_provider=none --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 --stages=2 --warps_m=4  \
 | |
|                   --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 --max_cc=1024
 | |
| 
 | |
|            Bytes: 2055798784  bytes
 | |
|            FLOPs: 118482796544  flops
 | |
| 
 | |
|          Runtime: 8.13237  ms
 | |
|           Memory: 235.431 GiB/s
 | |
| 
 | |
|             Math: 14569.3 GFLOP/s
 | |
| 
 | |
| ```
 | |
| 
 | |
| ## Example Tensor Core Convolution Operation
 | |
| 
 | |
| Example command line for profiling forward propagation convolution kernels runing on Tensor Cores is as follows:
 | |
| ```bash
 | |
| $ ./tools/profiler/cutlass_profiler --kernels=tensorop*fprop  --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 
 | |
| 
 | |
| 
 | |
| 
 | |
| =============================
 | |
|   Problem ID: 1
 | |
| 
 | |
|         Provider: CUTLASS
 | |
|    OperationKind: conv2d
 | |
|        Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_64x4_nhwc
 | |
| 
 | |
|           Status: Success
 | |
|     Verification: ON
 | |
|      Disposition: Passed
 | |
| 
 | |
| reference_device: Passed
 | |
| 
 | |
|        Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
 | |
|                   --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc  \
 | |
|                   --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
 | |
|                   --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=64 --stages=4  \
 | |
|                   --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
 | |
| 
 | |
|            Bytes: 1130659840  bytes
 | |
|            FLOPs: 118482796544  flops
 | |
| 
 | |
|          Runtime: 0.945071  ms
 | |
|           Memory: 1114.21 GiB/s
 | |
| 
 | |
|             Math: 125369 GFLOP/s
 | |
| 
 | |
| 
 | |
| ```
 | |
| 
 | |
| # Copyright
 | |
| 
 | |
| Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 | |
| SPDX-License-Identifier: BSD-3-Clause
 | |
| 
 | |
| ```
 | |
|   Redistribution and use in source and binary forms, with or without
 | |
|   modification, are permitted provided that the following conditions are met:
 | |
| 
 | |
|   1. Redistributions of source code must retain the above copyright notice, this
 | |
|   list of conditions and the following disclaimer.
 | |
| 
 | |
|   2. Redistributions in binary form must reproduce the above copyright notice,
 | |
|   this list of conditions and the following disclaimer in the documentation
 | |
|   and/or other materials provided with the distribution.
 | |
| 
 | |
|   3. Neither the name of the copyright holder nor the names of its
 | |
|   contributors may be used to endorse or promote products derived from
 | |
|   this software without specific prior written permission.
 | |
| 
 | |
|   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 | |
|   AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 | |
|   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 | |
|   DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
 | |
|   FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 | |
|   DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 | |
|   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
 | |
|   CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 | |
|   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 | |
|   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 | |
| ```
 | 
