Improved formatting, clarity, and content of several documents. (#64)

* Improved formatting, clarity, and content of several documents.
2019-11-20 10:42:15 -08:00 · 2019-11-20 10:42:15 -08:00 · 8aca98f9a7
commit 8aca98f9a7
parent f4d9c8f755
8 changed files with 118 additions and 65 deletions
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@ -30,6 +30,7 @@ Fei Hu
 Alan Kaatz  
 Tina Li  
 Timmy Liu  
+Piotr Majcher  
 Duane Merrill  
 Kevin Siu  
 Markus Tavenrath  
--- a/README.md
+++ b/README.md
@ -91,6 +91,7 @@ CUTLASS 2.0 is described in the following documents and the accompanying
 - [GEMM API](media/docs/gemm_api.md) - describes the CUTLASS GEMM model and C++ template concepts 
 - [Code Organization](media/docs/code_organization.md) - describes the organization and contents of the CUTLASS project
 - [Terminology](media/docs/terminology.md) - describes terms used in the code
+- [Programming Guidelines](media/docs/programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
 - [Fundamental types](media/docs/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
 - [Layouts](media/docs/layout.md) - describes layouts of matrices and tensors in memory
 - [Tile Iterators](media/docs/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
--- a/media/docs/code_organization.md
+++ b/media/docs/code_organization.md
@ -82,35 +82,44 @@ scripts to span the design space.
 ```
 tools/
  library/                   # static/dynamic library containing all kernel instantiations of interest
-                             # (with some build-level filter switches to compile specific subsets, perhaps by architecture)
+                             # (with some build-level filter switches to compile specific subsets)

    include/
      cutlass/
-        library/                      # header files for CUTLASS Deliverables Library (in cutlass::library:: namespace)
+        library/             # header files for CUTLASS Deliverables Library (in cutlass::library:: namespace)

-          library.h                   # defines enums and structs to describe the tiled structure of operator instances          
-          manifest.h                  # collection of all instances
+          library.h          # defines enums and structs to describe the tiled structure of operator instances          
+          manifest.h         # collection of all instances

-    scripts/                          # scripts to procedurally generate CUTLASS template instances
+    scripts/                 # scripts to procedurally generate CUTLASS template instances

      gemm_operations.py
      library.py
-      generator.py                    # entry point of procedural generation scripts - invoked by cmake
+      generator.py           # entry point of procedural generation scripts - invoked by cmake
      manifest.py

    src/
-    
-
 ```

-## Examples
-
-To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`. 
-
 When CMake is executed, the CUTLASS Instance Library generator scripts are executed to construct a set of
 instantiations in `build/tools/library/generated/`.

-The CUTLASS Profiler is designed to initialize the CUTLASS Instance Library and execute all operations contained therein.
+### CUTLASS Profiler
+
+The CUTLASS Profiler is designed to load the CUTLASS Instance Library and execute all operations contained therein.
+This command-line driven application constructs an execution environment for evaluating functionality and performance. 
+It is implemented in
+```
+tools/
+  profiler/
+```
+
+and may be built as follows.
+```
+$ make cutlass_profiler -j
+```
+
+[Further details about the CUTLASS Profiler are described here.](/media/docs/profiler.md)

 ### CUTLASS Utilities

@ -122,9 +131,13 @@ tools/
    include/
      cutlass/
        util/                   # CUTLASS Utility companion library
-          reference/            #   reference implementation of CUTLASS operators - minimal consideration for performance
+
+          reference/            #  functional reference implementation of CUTLASS operators
+                                #    (minimal consideration for performance)
+            
            detail/
              *
+
            device/             #  device-side reference implementations of CUTLASS operators
              thread/
              kernel/
@ -134,28 +147,36 @@ tools/
          *
 ```

-[More details about CUTLASS Utilities may be found here.](media/docs/utilities.md)
+[More details about CUTLASS Utilities may be found here.](/media/docs/utilities.md)

-### CUTLASS Profiler
-
-This is application constructs an execution environment for evaluating the functionality and performance of
-CUTLASS components. It is implemented in
-```
-tools/
-  profiler/
-```
-
-and may be built as follows.
-```
-$ make cutlass_profiler -j
-```
-
-[Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)

 ## Examples

 To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`. 

+CUTLASS SDK examples apply CUTLASS templates to implement basic computations.
+
+```
+examples/
+  00_basic_gemm/             # launches a basic GEMM with single precision inputs and outputs
+
+  01_cutlass_utilities/      # demonstrates CUTLASS Utilities for allocating and initializing tensors
+  
+  02_dump_reg_smem/          # debugging utilities for printing register and shared memory contents
+  
+  03_visualize_layout/       # utility for visualizing all layout functions in CUTLASS
+
+  04_tile_iterator/          # example demonstrating an iterator over tiles in memory
+
+  05_batched_gemm/           # example demonstrating CUTLASS's batched strided GEMM operation
+
+  06_splitK_gemm/            # exmaple demonstrating CUTLASS's Split-K parallel reduction kernel
+
+  07_volta_tensorop_gemm/    # example demonstrating mixed precision GEMM using Volta Tensor Cores
+
+  08_turing_tensorop_gemm/   # example demonstrating integer GEMM using Turing Tensor Cores
+```
+
 ## Media

 This directory contains documentation, images, and performance result data which accompanies the CUTLASS library and components.
--- a/media/docs/doxygen_mainpage.md
+++ b/media/docs/doxygen_mainpage.md
@ -1,5 +1,3 @@
-![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS 2.0")
-
 # CUTLASS 2.0

 _CUTLASS 2.0 - November 2019_
--- a/media/docs/gemm_api.md
+++ b/media/docs/gemm_api.md
@ -173,7 +173,8 @@ struct Mma {
  /// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)
  struct FragmentB;

-  /// Iterator of C operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
+  /// Iterator of C operand in shared memory - 
+  ///    satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
  struct IteratorC;

  /// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)
@ -322,7 +323,8 @@ struct Mma {
  /// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)
  struct FragmentB;

-  /// Iterator of C operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
+  /// Iterator of C operand in shared memory - 
+  ///     satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
  struct IteratorC;

  /// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)
--- a/media/docs/profiler.md
+++ b/media/docs/profiler.md
@ -7,27 +7,27 @@
 The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
 defined in the CUTLASS Instance Library.

-The CUTLASS Profiler sources are stored in 
-```
-tools/
-  profiler/
-```
-
-and may be compiled as follows.
-```
+The CUTLASS Profiler may be compiled with:
+```bash
 $ make cutlass_profiler -j
 ```

 To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
 To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
-```
+```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all
 ...
 $ make cutlass_profiler -j
 ```

-The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
+The CUTLASS Profiler sources are stored in 
+```bash
+tools/
+  profiler/
 ```
+
+The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
+```bash
 CUTLASS Performance Tool
 usage:
    cutlass_profiler [options]
@ -122,7 +122,7 @@ Example:
 The complete set of arguments available to each operation may be viewed by specifying the operation name
 in addition to `--help`. The argument flags and their aliases usable for GEMM appear as follows.

-```
+```bash
 $ ./tools/profiler/cutlass_profiler --operation=gemm --help

 GEMM
@ -190,7 +190,7 @@ Test your changes to gemm kernels with a quick functional test and save results
 ## Example SGEMM

 Example command line for profiling SGEMM kernels is as follows:
-```
+```bash
 $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096

 =============================
@ -202,9 +202,9 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096
 Disposition: Passed
      Status: Success

-   Arguments:  --m=4352 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0  \
+   Arguments:  --m=4352 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0        \
               --split_k_slices=1 --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8  \
-               --stages=2 --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50  \
+               --stages=2 --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50       \
               --max_cc=1024

       Bytes: 52428800  bytes
@ -223,7 +223,7 @@ Note, the arguments which appear in the output may be used as command line param

 To execute kernels targeting Tensor Core operations, supply the flag `--op_class=tensorop` in the command line.

-```
+```bash
 $ ./tools/profiler/cutlass_profiler --op_class=tensorop

 =============================
@ -235,9 +235,10 @@ $ ./tools/profiler/cutlass_profiler --op_class=tensorop
 Disposition: Passed
      Status: Success

-   Arguments:  --m=4352 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f16:column --alpha=1 --beta=0 --split_k_slices=1  \
-               --batch_count=1 --op_class=tensorop --accum=f16 --cta_m=128 --cta_n=128 --cta_k=32 --stages=2  \
-               --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 --max_cc=1024  \
+   Arguments:  --m=4352 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f16:column --alpha=1 --beta=0   \
+               --op_class=tensorop --accum=f16 --cta_m=128 --cta_n=128 --cta_k=32 --stages=2             \
+               --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8                     \
+               --min_cc=75 --max_cc=1024


       Bytes: 52428800  bytes
@ -258,7 +259,7 @@ as an inclusive range with the following syntax `start:end:increment` or simply
 For example, the following sweeps over the range of the GEMM K dimension from 8 to 4096 in increments
 of 8 elements.

-```
+```bash
 $ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn --m=4352 --n=4096 --k=8:4096:8
 ```

@ -268,17 +269,18 @@ By default, runtime and computed GFLOP/s are reported for each operation and pro
 a table of comma separated values are reported at the end of the execution. This may be output to a file
 with the `--output=<filename.csv>` command line option as shown:

-```
-$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn --m=4352 --n=4096 --k=8:4096:8 --output=report.csv
+```bash
+$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn            \
+                                    --m=4352 --n=4096 --k=8:4096:8 --output=report.csv
 ```

 To faclitate generation of pivot tables and charts, additional columns may be prepended with the
 `--tags=<column>:<value>` option. One or more tags may be specified using a comma-delimited list.

-```
-$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn \
-  --m=4352 --n=4096 --k=8:4096:8 --output=report.csv \
-  --tags=cutlass:2.0,date:2019-11-19
+```bash
+$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn            \
+                                    --m=4352 --n=4096 --k=8:4096:8 --output=report.csv \
+                                    --tags=cutlass:2.0,date:2019-11-19
 ```  

 # Copyright
--- a/media/docs/programming_guidelines.md
+++ b/media/docs/programming_guidelines.md
@ -90,14 +90,15 @@ is able to unroll the loop bodies, map array elements to registers, and construc
 All loops expected to be unrolled should be annotated with `CUTLASS_PRAGMA_UNROLL` to explicitly direct the compiler
 to unroll them. 

-```
+```c++
 int const kN = 8;
-Array<float, kN> x;                               // Array we would like to store in registers
+Array<float, kN> x;                       // Array we would like to store in registers

-CUTLASS_PRAGMA_UNROLL                            // Directs the CUDA compiler to unroll this loop.
-for (int idx = 0; idx < kN; ++idx) {             // Loop has constant number of iterations
+CUTLASS_PRAGMA_UNROLL                     // Directs the CUDA compiler to unroll this loop.
+for (int idx = 0; idx < kN; ++idx) {      // Loop has constant number of iterations.

-  x[i] = float(idx);                             // Indirect access by induction variable results in direct register access
+  x[i] = float(idx);                      // Indirect access by induction variable results in 
+                                          // direct register access.
 }
 ```

--- a/media/docs/quickstart.md
+++ b/media/docs/quickstart.md
@ -90,8 +90,35 @@ $ make test_unit -j
 [  PASSED  ] 946 tests.
 $
 ```
+The exact number of tests run is subject to change as we add more functionality.

-No tests should fail.
+No tests should fail. Unit tests automatically construct the appropriate runtime filters
+to avoid executing on architectures that do not support all features under test.
+
+The unit tests are arranged hierarchically mirroring the CUTLASS Template Library. This enables
+parallelism in building and running tests as well as reducing compilation times when a specific
+set of tests are desired.
+
+For example, the following executes strictly the warp-level GEMM tests.
+```bash
+$ make test_unit_gemm_warp -j
+...
+...
+[----------] 3 tests from SM75_warp_gemm_tensor_op_congruous_f16
+[ RUN      ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x8_32x128x8_16x8x8
+[       OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x8_32x128x8_16x8x8 (0 ms)
+[ RUN      ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_64x64x32_16x8x8
+[       OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_64x64x32_16x8x8 (2 ms)
+[ RUN      ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_32x32x32_16x8x8
+[       OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_32x32x32_16x8x8 (1 ms)
+[----------] 3 tests from SM75_warp_gemm_tensor_op_congruous_f16 (3 ms total)
+...
+...
+[----------] Global test environment tear-down
+[==========] 104 tests from 32 test cases ran. (294 ms total)
+[  PASSED  ] 104 tests.
+[100%] Built target test_unit_gemm_warp
+```

 ## Using CUTLASS within other applications