From ccb697bac77fcc898e9c897b2c90aa5b60ac72fb Mon Sep 17 00:00:00 2001
From: Manish Gupta <mgupta.iitr@gmail.com>
Date: Sun, 22 Nov 2020 18:11:37 -0800
Subject: [PATCH] cutlass 2.4 documentation only update

---
 CHANGELOG.md                            |   2 +-
 README.md                               | 133 +++++++++++++++++--
 media/docs/functionality.md             |  11 +-
 media/docs/implicit_gemm_convolution.md |   4 +-
 media/docs/profiler.md                  | 165 +++++++++++++++---------
 media/docs/quickstart.md                |  68 +++++++---
 6 files changed, 279 insertions(+), 104 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index eded0a4e..d90f7137 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,7 +8,7 @@
     * Spatial dimensions: 1-D, 2-D, and 3-D
     * Layout: NHWC, NCxHWx
   * Implicit GEMM convolution components: 
-    * Global memory iterators supporting fprop, dgrad, and wgrad
+    * Global memory iterators supporting Fprop, Dgrad, and Wgrad
     * `MmaMultistage` for implicit GEMM convolution for NVIDIA Ampere architecture
     * `MmaPipeline` for implicit GEMM convolution for NVIDIA Volta and Turing architectures
     * [Documentation](/media/docs/implicit_gemm_convolution.md) describing Implicit GEMM Convolution algorithm and implementation
diff --git a/README.md b/README.md
index d7a1d7d4..d8855c73 100644
--- a/README.md
+++ b/README.md
@@ -288,6 +288,7 @@ It can be built as follows:
 ```bash
 $ make cutlass_profiler -j16
 ```
+## Building all GEMM and Convolution kernels (_long_ build times)
 
 By default, only one tile size is instantiated for each data type, math instruction, and layout.
 To instantiate all, set the following environment variable when running CMake from an empty `build/` directory.
@@ -298,17 +299,71 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all
 $ make cutlass_profiler -j16
 ```
 
-To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with 
-wildcard characters may be reduce the set of kernels. The following builds exactly one kernel:
+## Building a subset of GEMM and Convolution kernels (_reduced_ build times)
 
+To compile strictly one kernel or a small set of kernels, a comma-delimited list of kernel names with 
+wildcard characters may be used to reduce the set of kernels. The following examples show building exactly one
+or a subset of kernels for NVIDIA Ampere and Turing architecture:
+
+### Building a subset Tensor Core GEMM kernels
+
+To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targetting NVIDIA Ampere and Turing architecture, 
+use the below cmake command line:
 ```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
 ...
 $ make cutlass_profiler -j16
 ```
 
-Example command line for profiling SGEMM kernels is as follows:
+Example command line for profiling a subset of Tensor Core GEMM kernels is as follows:
+```bash
+./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*gemm_f16_*_nt_align8 --m=3456 --n=4096 --k=4096
+
+...
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: gemm
+       Operation: cutlass_tensorop_s1688gemm_f16_256x128_32x2_nt_align8
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+reference_device: Passed
+          cuBLAS: Passed
+
+       Arguments: --gemm_kind=universal --m=3456 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f32:column --alpha=1  \
+                  --beta=0 --split_k_slices=1 --batch_count=1 --op_class=tensorop --accum=f32 --cta_m=256 --cta_n=128  \
+                  --cta_k=32 --stages=2 --warps_m=4 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75  \
+                  --max_cc=1024
+
+           Bytes: 118489088  bytes
+           FLOPs: 115992428544  flops
+
+         Runtime: 1.55948  ms
+          Memory: 70.7616 GiB/s
+
+            Math: 74378.8 GFLOP/s
+
+
+
+=============================
+...
 ```
+
+### Building one CUDA Core GEMM kernel
+
+To compile one SGEMM kernel targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
+...
+$ make cutlass_profiler -j16
+```
+
+Example command line for profiling single SGEMM CUDA kernel is as follows:
+```bash
 $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
 
 =============================
@@ -335,24 +390,69 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
           Memory: 24.934 GiB/s
 
             Math: 17218.4 GFLOP/s
+
+=============================
 ```
 
-To compile strictly 2-D or 3-D convolution kernels, filter by operation
+### Building a subset of Tensor Core Convolution kernels
+
+To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation 
+and FP16 input targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
 ```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_OPERATIONS=conv2d,conv3d
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
 ...
 $ make cutlass_profiler -j16
 ```
 
-or by name
+Example command line for profiling a subset of Tensor Core convolution kernels is as follows:
 
 ```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=sfprop,s16816fprop,s16816dgrad,s16816wgrad
+$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s*fprop_optimized_f16 --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
+
+...
+=============================
+  Problem ID: 1
+
+        Provider: CUTLASS
+   OperationKind: conv2d
+       Operation: cutlass_tensorop_s16816fprop_optimized_f16_128x128_32x5_nhwc
+
+          Status: Success
+    Verification: ON
+     Disposition: Passed
+
+reference_device: Passed
+
+       Arguments: --conv_kind=fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --p=224 --q=224 --pad_h=1 --pad_w=1  \
+                  --stride_h=1 --stride_w=1 --dilation_h=1 --dilation_w=1 --Activation=f16:nhwc --Filter=f16:nhwc --Output=f32:nhwc  \
+                  --conv_mode=cross --iterator_algorithm=optimized --alpha=1 --beta=0 --split_k_mode=serial --split_k_slices=1  \
+                  --eq_gemm_provider=none --op_class=tensorop --accum=f32 --cta_m=128 --cta_n=128 --cta_k=32 --stages=5  \
+                  --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=16 --min_cc=80 --max_cc=1024
+
+           Bytes: 1130659840  bytes
+           FLOPs: 118482796544  flops
+
+         Runtime: 0.711496  ms
+          Memory: 1479.99 GiB/s
+
+            Math: 166526 GFLOP/s
+
+=============================
+...
+```
+
+
+### Building one Convolution CUDA kernel
+
+To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation 
+and FP32 input targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
 ...
 $ make cutlass_profiler -j16
 ```
 
-Example command line for profiling 2-D convolution kernels is as follows:
+Example command line for profiling one CUDA Core convolution kernel:
 
 ```bash
 $ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
@@ -380,14 +480,21 @@ reference_device: Passed
            Bytes: 2055798784  bytes
            FLOPs: 118482796544  flops
 
-         Runtime: 8.13237  ms
-          Memory: 235.431 GiB/s
+         Runtime: 7.34266  ms
+          Memory: 260.752 GiB/s
 
-            Math: 14569.3 GFLOP/s
+            Math: 16136.2 GFLOP/s
+
+
+=============================
 
 ```
 
-[Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)
+## More Details on Compiling CUTLASS Kernels and CUTLASS Profiler
+- Please follow the links for more CMake examples on selectively compiling CUTLASS kernels:
+  - [GEMM CMake Examples](media/docs/quickstart.md#gemm-cmake-examples) 
+  - [Implicit GEMM conovlution CMake Examples](media/docs/quickstart.md#convolution-cmake-examples)
+- [Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)
 
 
 # About
diff --git a/media/docs/functionality.md b/media/docs/functionality.md
index 77f1ba14..aeb9bcf3 100644
--- a/media/docs/functionality.md
+++ b/media/docs/functionality.md
@@ -56,14 +56,15 @@ One can find and/or create equivalent dgrad and wgrad convolutional operators.
 | **Simt**            | 50,60,61,70,75     |  9.2+            | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm50.cu)                |
 | **TensorOp**        | 70                 |  10.1+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm70.cu) |
 | **TensorOp**        | 75                 |  10.2+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm75.cu) |
-| **TensorOp**        | 75                 |  10.2+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu) |
-| **Simt**            | 80                 |  11.0+            | `f32 * f32 + f32 => f32`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu)                |
-| **Simt**            | 80                 |  11.0+            | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu)                |
+| **TensorOp**        | 75                 |  10.2+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm75.cu) |
+| **TensorOp**        | 75                 |  10.2+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm75.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm75.cu) |
+| **Simt**            | 80                 |  11.0+           | `f32 * f32 + f32 => f32`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f32nhwc_f32nhwc_f32nhwc_simt_f32_sm80.cu)                |
+| **Simt**            | 80                 |  11.0+           | `cf32 * cf32 + cf32 => cf32`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_cf32nhwc_cf32nhwc_cf32nhwc_simt_f32_sm80.cu)                |
 | **TensorOp**        | 80                 |  11.0+           | `f16 * f16 + f32 => {f16, f32}`| NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
 | **TensorOp**        | 80                 |  11.0+           | `f16 * f16 + f16 => f16`       | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
 | **TensorOp**        | 80                 |  11.0+           | `tf32 * tf32 + f32 => f32`     | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu) |
-| **TensorOp**        | 80                 |  11.0+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC             |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80                 |  11.0+           | `s8 * s8 + s32 => {s32, s8}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s8ncxhwx_s8cxrskx_s8ncxhwx_tensor_op_s32_sm80.cu) |
+| **TensorOp**        | 80                 |  11.0+           | `s4 * s4 + s32 => {s32, s4}`   | NHWC, NCxHWx     |  [example](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4nhwc_s4nhwc_s32nhwc_tensor_op_s32_sm80.cu), [ncxhwx](/test/unit/conv/device/conv2d_fprop_implicit_gemm_s4ncxhwx_s4cxrskx_s4ncxhwx_tensor_op_s32_sm80.cu) |
 
 
 
diff --git a/media/docs/implicit_gemm_convolution.md b/media/docs/implicit_gemm_convolution.md
index c86d41df..5cc0a258 100644
--- a/media/docs/implicit_gemm_convolution.md
+++ b/media/docs/implicit_gemm_convolution.md
@@ -51,7 +51,7 @@ f(p, r) = p * stride_h + R - r - 1 + pad_h
 g(q, s) = h * stride_w + S - s - 1 + pad_w
 ```
 
-A [host](/tools/util/include/reference/host/convolution.h) and [device](/tools/util/include/reference/device/convolution.h) 
+A [host](/tools/util/include/cutlass/util/reference/host/convolution.h) and [device](/tools/util/include/cutlass/util/reference/device/convolution.h) 
 reference implementation are provided in the CUTLASS Utilities.
 
 This computation may be mapped to the elements of a matrix product as follows.
@@ -347,7 +347,7 @@ creating GEMM-B tile in shared memory.
 The improvements covered by optimized iterators are: 
 - (a) Precomputing kernel-invariant pointer deltas on the host 
 - (b) Computing cta-invariant mask predicates on device-side iterator ctors
-- (c) Use of [fast divmod](include/cutlass/fast_math.h) to map GEMM dimenstions to convolution tensors. 
+- (c) Use of [fast divmod](/include/cutlass/fast_math.h) to map GEMM dimenstions to convolution tensors. 
 For example, _optimized_ activation iterator uses fast divmod to map GEMM _M_ to NPQ 
 for activation iterator
 
diff --git a/media/docs/profiler.md b/media/docs/profiler.md
index 032848c6..c7ce91a7 100644
--- a/media/docs/profiler.md
+++ b/media/docs/profiler.md
@@ -5,15 +5,17 @@
 # CUTLASS Profiler
 
 The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
-defined in the CUTLASS Instance Library.
+defined in the CUTLASS Instance Library. The CUTLASS Profiler is capable of executing each GEMM, Sparse Gemm, 
+Conv2d, and Conv3d kernel.
 
 The CUTLASS Profiler may be compiled with:
 ```bash
 $ make cutlass_profiler -j
 ```
 
-To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
-To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
+To limit compilation time, only one tile size (typically 128x128) is instantiated for each data type, 
+math instruction, and layout. To instantiate all sizes, set the following environment variable when running CMake from an 
+empty `build/` directory.
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all  -DCUTLASS_UNITY_BUILD_ENABLED=ON
 ...
@@ -32,82 +34,121 @@ The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profi
 ```bash
 CUTLASS Performance Tool
 usage:
+
     cutlass_profiler [options]
-  
+
   --help
-  
-  --mode={profile*,single,dry,enumerate}     Regular profiling, single kernel mode only, or no profiling.
-  
-  --device-info                              Prints information on all GPUs present in the system
-  
-  --operation=<operation_kind>               CUTLASS operation to profile.
-  
-  --kernels=<kernel names>                   Names of individual kernels to execute. All are executed if not specified.
+
+  --mode=<string>                                  Cutlass profiler execution mode.
+                                                    --mode=profile    regular verification and profiling (default)
+                                                    --mode=dry_run    no kernels are launched or workspaces allocated
+                                                    --mode=enumerate  lists all operation kind and operations
+                                                    --mode=trace      executes a single device-side computation with
+                                                                       no other kernel launches
+
+  --device-info                                    Prints information on all GPUs present in the system
+
+  --operation=<operation_kind>                     CUTLASS operation to profile.
+
+  --kernels=<string_list>                          Filter operations by kernel names. For example, call all kernels with
+                                                   ("s1688" and "nt") or ("s844" and "tn" and "align8") in their
+                                                   operation name using --kernels="s1688*nt, s884*tn*align8"
+
+  --ignore-kernels=<string_list>                   Excludes kernels whose names match anything in this list.
 
 Device:
-  --device=<int>                             CUDA Device ID
+  --device=<int>                                   CUDA Device ID
+
+  --compute-capability=<int>                       Override the compute capability.
+
+  --llc-capacity=<capacity in KiB>                 Capacity of last-level cache in kilobytes. If this is non-zero,
+                                                   profiling phases cycle through different input tensors to induce
+                                                   capacity misses in the L2.
+
 
 Initialization:
-  --initialization=<bool>                    Enables initialization (default: true). If false, device memory is
-                                             not initialized after allocation.
+  --initialization=<bool>                          Enables initialization (default: true). If false, device memory is
+                                                   not initialized after allocation.
 
-  --initialization-provider=<provider>       Selects 'device' or 'host' initialization.
+  --initialization-provider=<provider>             Selects initialization provider {host, device*}. (default: '*')
 
-  --dist=<distribution>                      Data distribution of input tensors
+  --dist=<distribution>                            Data distribution of input tensors {uniform*, gaussian, identity, sequential}
+                                                    --dist=uniform,min:<double>,max:<double>,scale:<integer>
+                                                    --dist=gaussian,mean:<double>,stddev:<double>,scale:<integer>
+                                                    --dist=sequential,start:<double>,delta:<double>,scale:<integer>
+                                                    --dist=identity
+
+  --seed=<int>                                     Random number generator seed. Used to enforce deterministic
+                                                   initialization.
 
-  --seed=<int>                               Random number generator seed. Used to enforce deterministic
-                                             initialization.
 
 Library:
-  --library-algo-mode=<mode>                 Indicates algorithm mode used to call libraries such as cuBLAS and cuDNN.
-                                             mode={default*,matching,best}
+  --library-algo-mode=<mode>                       Indicates algorithm mode used to call libraries such as cuBLAS and cuDNN.
+                                                   mode={default*,matching,best}
+
+  --library-algos=<range-list>                     If --algorithm-mode=best, permits specifying a selection of algorithms.
 
-  --library-algos=<range-list>               If --algorithm-mode=best, permits specifying a selection of algorithms.
 
 Profiling:
-  --profiling-iterations=<iterations>        Number of iterations to profile each kernel. If zero, kernels
-                                             are launched up to the profiling duration.
+  --workspace-count=<workspace count>              Number of discrete workspaces maintained to avoid cache-resident 
+                                                 If zero (default), the amount is chosen for each workload based on 
+                                                 capacity of the last-level cache.
+
+  --profiling-iterations=<iterations>              Number of iterations to profile each kernel. If zero, kernels
+                                                   are launched up to the profiling duration.
+
+  --warmup-iterations=<iterations>                 Number of iterations to execute each kernel prior to profiling.
+
+  --sleep-duration=<duration>                      Number of ms to sleep between profiling periods (ms).
+
+  --profiling-enabled=<bool>                       If true, profiling is actually conducted.
+
+  --providers=<providers>                          List of providers to be profiled for performance. (default: '*')
+                                                   Gemm providers {cutlass*, cublas*}
+                                                   Conv2d providers {cutlass*, cudnn*}
 
-  --warmup-iterations=<iterations>           Number of iterations to execute each kernel prior to profiling.
-  
-  --sleep-duration=<duration>                Number of ms to sleep between profiling periods (ms)
-  
-  --profiling-enabled=<bool>                 If true, profiling is actually conducted.
-  
-  --providers=<providers>                    List of providers to be profiled for performance
 
 Verification:
-  --verification-enabled=<bool>              Whether to perform verification checks.
+  --verification-enabled=<bool>                    Whether to perform verification checks.
 
-  --epsilon=<error>                          Error threshold. Setting to zero (default) requires
-                                             bit-level equivalence.
+  --epsilon=<error>                                Error threshold. Setting to zero (default) requires
+                                                   bit-level equivalence.
 
-  --nonzero-floor=<floor>                    Results whose absolute value is less than this quantity
-                                             are treated as zero for comparisons.
+  --nonzero-floor=<floor>                          Results whose absolute value is less than this quantity
+                                                   are treated as zero for comparisons.
 
-  --save-workspace={*never,incorrect,always} Specifies when to save the GEMM inputs and results to the filesystem.
+  --save-workspace=<string>                        Specifies when to save the GEMM inputs and results to the filesystem.
+                                                    --save-workspace=never      never save workspace (default)
+                                                    --save-workspace=incorrect  save workspace for incorrect results
+                                                    --save-workspace=always     always save workspace
+
+  --verification-providers=<providers>             List of providers used to verify result. (default: '*')
+                                                   Gemm verification-providers {cublas*}
+                                                   Conv2d verification-providers {cudnn*, device*, host}
 
-  --verification-providers=<providers>       List of providers used to verify result. (default: cublas)
 
 Report:
-  --append=<bool>                            If true, result is appended to possibly existing file. Otherwise, 
-                                             any existing file is overwritten.
+  --append=<bool>                                  If true, result is appended to possibly existing file. Otherwise, 
+                                                   any existing file is overwritten.
 
-  --output=<path>                            Path to output file for machine readable results.
+  --output=<path>                                  Path to output file for machine readable results. Operation kind and '.csv' is appended.
 
-  --report-not-run=<bool>                    If true, reports the status of all kernels including those that
-                                             do not satisfy the given arguments.
+  --junit-output=<path>                            Path to junit output file for result reporting. Operation kind and '.junit.xml' is appended.
 
-  --tags=<column:tag,...>                    Inserts leading columns in output table and uniform values for each
-                                             column. Useful for generating pivot tables.
+  --report-not-run=<bool>                          If true, reports the status of all kernels including those that
+                                                   do not satisfy the given arguments.
+
+  --tags=<column:tag,...>                          Inserts leading columns in output table and uniform values for each
+                                                   column. Useful for generating pivot tables.
+
+  --verbose=<bool>                                 Prints human-readable text to stdout. If false, nothing is written to stdout.
 
-  --verbose=<bool>                           If true (default), prints human-readable text to stdout.
 
 About:
-  --version                                  CUTLASS 2.2.0 built on Jun  8 2020 at 07:59:33
+  --version                                        CUTLASS 2.4.0 built on Nov 19 2020 at 11:59:00
+
 
 Operations:
-  --operation=<operation_name>               Specifies a particular operation to run or print the usage statement.
 
      gemm                                          General matrix-matrix product. D = alpha * A*B + beta * C
      spgemm                                        Structured sparse GEMM. D = alpha * A*B + beta * C
@@ -115,7 +156,7 @@ Operations:
      conv3d                                        Conv3d operation. Output(Tensor5D) = alpha * Input(Tensor5D) * Filter(Tensor5D) + beta * Input(Tensor5D)
 
 
-For more details about a particular operation, specify the operation name with --help.
+For details about a particular function, specify the function name with --help.
 
 Example:
 
@@ -125,12 +166,15 @@ Example:
 
   $ cutlass_profiler --operation=Conv2d --help
 
-  $ cutlass_profiler --operation=SparseGemm --help
 ```
 
 # GEMM
 
-The CUTLASS Profiler is capable of executing each GEMM kernel.
+The CUTLASS Profiler is capable of executing GEMM and Sparse GEMM problems.
+
+The CUTLASS Profiler can be built with cuBLAS enabled to use as a reference implementation. If CMake detects
+the cuBLASS library available in the system, it is included as a dependency. This may be explicitly overridden
+with CMake flag `CUTLASS_ENABLE_CUBLAS`. 
 
 ## GEMM Arguments
 
@@ -202,7 +246,7 @@ Test your changes to gemm kernels with a quick functional test and save results
    --providers=cutlass --output=functional-test.csv
 ```
 
-## Example CUDA Core GEMM Operation (SGEMM)
+## Example CUDA Core GEMM Operation
 
 Example command line for profiling SGEMM kernels is as follows:
 ```bash
@@ -239,10 +283,9 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
 Note, the arguments which appear in the output may be used as command line parameters for subsequent invocations.
 
 
-## Example Tensor Core GEMM Operations (S16816GEMM)
+## Example Tensor Core GEMM Operations
 
 To execute kernels targeting Tensor Core operations, supply the flag `--op_class=tensorop` in the command line.
-
 ```bash
 $ ./tools/profiler/cutlass_profiler --op_class=tensorop --m=3456 --n=4096 --k=8192
 
@@ -382,12 +425,11 @@ Profile a particular convolution (specify all the convolution parameters):
 
 ```
 
-## Example CUDA Core Convolution Operation (SFPROP)
-
-Example command line for profiling Convolution kernels is as follows:
+## Example CUDA Core Convolution Operation
 
+Example command line for profiling forward propagation convolution kernels on CUDA cores is as follows:
 ```bash
-$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc  --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
+$ ./tools/profiler/cutlass_profiler --kernels=simt_sfprop  --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
 
 
 =============================
@@ -419,12 +461,11 @@ reference_device: Passed
 
 ```
 
-## Example Tensor Core Convolution Operation (S16816FPROP)
-
-Example command line for profiling Convolution kernels is as follows:
+## Example Tensor Core Convolution Operation
 
+Example command line for profiling forward propagation convolution kernels runing on Tensor Cores is as follows:
 ```bash
-$ ./tools/profiler/cutlass_profiler --kernels=cutlass_tensorop_s16816fprop_optimized_f16_128x128_64x4_nhwc  --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 
+$ ./tools/profiler/cutlass_profiler --kernels=tensorop*fprop  --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 
 
 
 
diff --git a/media/docs/quickstart.md b/media/docs/quickstart.md
index 425d9270..f283da8a 100644
--- a/media/docs/quickstart.md
+++ b/media/docs/quickstart.md
@@ -47,6 +47,7 @@ You may also filter kernels by name by supplying a filter string with flag `CUTL
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=s16816gemm,s16816fprop*128x128
 ```
+See more examples on selectively compiling CUTLASS GEMM and convolution kernels [here](media/docs/quickstart.md#example-cmake-commands).
 
 You may explicitly exclude cuBLAS and cuDNN as dependencies with the following CMake flags.
 - `-DCUTLASS_ENABLE_CUBLAS=OFF`
@@ -87,14 +88,14 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096
         Math: 13854.9 GFLOP/s
 ```
 
-To execute the CUTLASS Profiler for Convolution, run the following example.
+To execute the CUTLASS Profiler for convolution, run the following example.
 ```bash
 $ ./tools/profiler/cutlass_profiler --kernels=s1688fprop --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 --pad_h=1 --pad_w=1
 ```
 
 To execute all CUTLASS 2-D convolution operators, execute the following.
 ```bash
-$ ./tools/profiler/cutlass_profiler --operation=conv2d--n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
+$ ./tools/profiler/cutlass_profiler --operation=conv2d --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
 
 
 =============================
@@ -462,52 +463,77 @@ int main() {
 }
 ```
 
-Kernels can be selectively included in the CUTLASS Library by specifying filter strings when
-executing CMake. For example, only single-precision GEMM kernels can be instantiated as follows.
+# Example CMake Commands 
 
+To instantiate all operations supporting all tile sizes, data types, and alignment constraints, specify 
+`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
 ```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=sgemm
+$ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=all
+```
+The above command line generates about seven thousand kernels targetting NVIDIA Ampere, Turing, and Volta architectures. 
+Compiling thousands of kernels for three different architectures is time consuming. Additionaly, this would also result 
+in a large binary size and on some platforms linker to fail on building the library.
+
+Enabling the "unity build" instantiates multiple kernel instances in each compilation unit, thereby reducing binary size 
+and avoiding linker limitations on some platforms.
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
 ```
 
+It is advised to only compile CUTLASS kernels for NVIDIA architectures one plans on running. Furthermore, kernels 
+can be selectively included in the CUTLASS Library by specifying filter strings and wildcard characters when executing CMake. 
+
+Several examples are defined below for convenience. They may be combined as a comma-delimited list. 
 Compling only the kernels desired reduces compilation time.
 
-To instantiate kernels of all tile sizes, data types, and alignment constraints, specify 
-`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
-
-Several recipes are defined below for convenience. They may be combined as a comma-delimited list.
 
+## GEMM CMake Examples
 **Example.** All GEMM kernels targeting NVIDIA Ampere Tensor Cores.
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
 ```
 
-**Example.** All kernels for NVIDIA Volta, Turing, and Ampere architectures. Enabling 
-the "unity build" instantiates multiple kernel instances in each compilation unit, thereby
-reducing binary size and avoiding linker limitations on some platforms.
-```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
-```
-
-**Example.** All GEMM kernels targeting Turing Tensor Cores.
+**Example.** All GEMM kernels targeting NVIDIA Turing Tensor Cores.
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
 ```
 
-**Example.** All GEMM kernels with single-precision accumulation.
+**Example.** All GEMM kernels with FP32 accumulation targeting NVIDIA Ampere, Turing, and Volta architectures.
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=s*gemm
 ```
 
-**Example.** All kernels which expect A and B to be column-major.
+**Example.** All kernels which expect A and B to be column-major or row-major targeting NVIDIA Ampere, Turing, and Volta architectures.
 ```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=gemm*nn
+$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=gemm*nn,gemm*tt
 ```
 
-**Example.** All planar complex GEMM variants.
+**Example.** All planar complex GEMM variants targeting NVIDIA Ampere, Turing, and Volta architectures.
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=planar_complex
 ```
 
+## Convolution CMake Examples
+**Example.** All convolution kernels targeting NVIDIA Ampere's 16816 Tensor Core operation
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='80' -DCUTLASS_LIBRARY_KERNELS=s16816fprop,s16816dgrad,s16816wgrad
+```
+
+**Example.** All forward propagation (fprop) convolution kernels targeting CUDA Cores for multiple NVIDIA architectures
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='50;60;61;70;75;80' -DCUTLASS_LIBRARY_KERNELS=sfprop
+```
+
+**Example.** All forward propagation (fprop) convolution kernels with FP32 accumulation and FP16 input targetting NVIDIA Ampere's 16816 Tensor Core operation
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='80' -DCUTLASS_LIBRARY_KERNELS=s16816fprop_*_f16
+```
+
+**Example.** All backward weight gradient (wgrad) convolution kernels with FP32 accumulation, FP16 input, and optimized global memory iterator 
+targetting NVIDIA Ampere, Turing, and Volta Tensor Core operations
+```bash
+$ cmake .. -DCUTLASS_NVCC_ARCHS='70;75;80' -DCUTLASS_LIBRARY_KERNELS=tensorop*s*wgrad_optimized_f16
+```
 
 # Copyright