Improved formatting, clarity, and content of several documents. (#64)
* Improved formatting, clarity, and content of several documents.
This commit is contained in:
parent
f4d9c8f755
commit
8aca98f9a7
@ -30,6 +30,7 @@ Fei Hu
|
||||
Alan Kaatz
|
||||
Tina Li
|
||||
Timmy Liu
|
||||
Piotr Majcher
|
||||
Duane Merrill
|
||||
Kevin Siu
|
||||
Markus Tavenrath
|
||||
|
||||
@ -91,6 +91,7 @@ CUTLASS 2.0 is described in the following documents and the accompanying
|
||||
- [GEMM API](media/docs/gemm_api.md) - describes the CUTLASS GEMM model and C++ template concepts
|
||||
- [Code Organization](media/docs/code_organization.md) - describes the organization and contents of the CUTLASS project
|
||||
- [Terminology](media/docs/terminology.md) - describes terms used in the code
|
||||
- [Programming Guidelines](media/docs/programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
|
||||
- [Fundamental types](media/docs/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
|
||||
- [Layouts](media/docs/layout.md) - describes layouts of matrices and tensors in memory
|
||||
- [Tile Iterators](media/docs/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
|
||||
|
||||
@ -82,35 +82,44 @@ scripts to span the design space.
|
||||
```
|
||||
tools/
|
||||
library/ # static/dynamic library containing all kernel instantiations of interest
|
||||
# (with some build-level filter switches to compile specific subsets, perhaps by architecture)
|
||||
# (with some build-level filter switches to compile specific subsets)
|
||||
|
||||
include/
|
||||
cutlass/
|
||||
library/ # header files for CUTLASS Deliverables Library (in cutlass::library:: namespace)
|
||||
library/ # header files for CUTLASS Deliverables Library (in cutlass::library:: namespace)
|
||||
|
||||
library.h # defines enums and structs to describe the tiled structure of operator instances
|
||||
manifest.h # collection of all instances
|
||||
library.h # defines enums and structs to describe the tiled structure of operator instances
|
||||
manifest.h # collection of all instances
|
||||
|
||||
scripts/ # scripts to procedurally generate CUTLASS template instances
|
||||
scripts/ # scripts to procedurally generate CUTLASS template instances
|
||||
|
||||
gemm_operations.py
|
||||
library.py
|
||||
generator.py # entry point of procedural generation scripts - invoked by cmake
|
||||
generator.py # entry point of procedural generation scripts - invoked by cmake
|
||||
manifest.py
|
||||
|
||||
src/
|
||||
|
||||
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`.
|
||||
|
||||
When CMake is executed, the CUTLASS Instance Library generator scripts are executed to construct a set of
|
||||
instantiations in `build/tools/library/generated/`.
|
||||
|
||||
The CUTLASS Profiler is designed to initialize the CUTLASS Instance Library and execute all operations contained therein.
|
||||
### CUTLASS Profiler
|
||||
|
||||
The CUTLASS Profiler is designed to load the CUTLASS Instance Library and execute all operations contained therein.
|
||||
This command-line driven application constructs an execution environment for evaluating functionality and performance.
|
||||
It is implemented in
|
||||
```
|
||||
tools/
|
||||
profiler/
|
||||
```
|
||||
|
||||
and may be built as follows.
|
||||
```
|
||||
$ make cutlass_profiler -j
|
||||
```
|
||||
|
||||
[Further details about the CUTLASS Profiler are described here.](/media/docs/profiler.md)
|
||||
|
||||
### CUTLASS Utilities
|
||||
|
||||
@ -122,9 +131,13 @@ tools/
|
||||
include/
|
||||
cutlass/
|
||||
util/ # CUTLASS Utility companion library
|
||||
reference/ # reference implementation of CUTLASS operators - minimal consideration for performance
|
||||
|
||||
reference/ # functional reference implementation of CUTLASS operators
|
||||
# (minimal consideration for performance)
|
||||
|
||||
detail/
|
||||
*
|
||||
|
||||
device/ # device-side reference implementations of CUTLASS operators
|
||||
thread/
|
||||
kernel/
|
||||
@ -134,28 +147,36 @@ tools/
|
||||
*
|
||||
```
|
||||
|
||||
[More details about CUTLASS Utilities may be found here.](media/docs/utilities.md)
|
||||
[More details about CUTLASS Utilities may be found here.](/media/docs/utilities.md)
|
||||
|
||||
### CUTLASS Profiler
|
||||
|
||||
This is application constructs an execution environment for evaluating the functionality and performance of
|
||||
CUTLASS components. It is implemented in
|
||||
```
|
||||
tools/
|
||||
profiler/
|
||||
```
|
||||
|
||||
and may be built as follows.
|
||||
```
|
||||
$ make cutlass_profiler -j
|
||||
```
|
||||
|
||||
[Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)
|
||||
|
||||
## Examples
|
||||
|
||||
To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`.
|
||||
|
||||
CUTLASS SDK examples apply CUTLASS templates to implement basic computations.
|
||||
|
||||
```
|
||||
examples/
|
||||
00_basic_gemm/ # launches a basic GEMM with single precision inputs and outputs
|
||||
|
||||
01_cutlass_utilities/ # demonstrates CUTLASS Utilities for allocating and initializing tensors
|
||||
|
||||
02_dump_reg_smem/ # debugging utilities for printing register and shared memory contents
|
||||
|
||||
03_visualize_layout/ # utility for visualizing all layout functions in CUTLASS
|
||||
|
||||
04_tile_iterator/ # example demonstrating an iterator over tiles in memory
|
||||
|
||||
05_batched_gemm/ # example demonstrating CUTLASS's batched strided GEMM operation
|
||||
|
||||
06_splitK_gemm/ # exmaple demonstrating CUTLASS's Split-K parallel reduction kernel
|
||||
|
||||
07_volta_tensorop_gemm/ # example demonstrating mixed precision GEMM using Volta Tensor Cores
|
||||
|
||||
08_turing_tensorop_gemm/ # example demonstrating integer GEMM using Turing Tensor Cores
|
||||
```
|
||||
|
||||
## Media
|
||||
|
||||
This directory contains documentation, images, and performance result data which accompanies the CUTLASS library and components.
|
||||
|
||||
@ -1,5 +1,3 @@
|
||||

|
||||
|
||||
# CUTLASS 2.0
|
||||
|
||||
_CUTLASS 2.0 - November 2019_
|
||||
|
||||
@ -173,7 +173,8 @@ struct Mma {
|
||||
/// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)
|
||||
struct FragmentB;
|
||||
|
||||
/// Iterator of C operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
|
||||
/// Iterator of C operand in shared memory -
|
||||
/// satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
|
||||
struct IteratorC;
|
||||
|
||||
/// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)
|
||||
@ -322,7 +323,8 @@ struct Mma {
|
||||
/// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)
|
||||
struct FragmentB;
|
||||
|
||||
/// Iterator of C operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
|
||||
/// Iterator of C operand in shared memory -
|
||||
/// satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
|
||||
struct IteratorC;
|
||||
|
||||
/// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)
|
||||
|
||||
@ -7,27 +7,27 @@
|
||||
The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
|
||||
defined in the CUTLASS Instance Library.
|
||||
|
||||
The CUTLASS Profiler sources are stored in
|
||||
```
|
||||
tools/
|
||||
profiler/
|
||||
```
|
||||
|
||||
and may be compiled as follows.
|
||||
```
|
||||
The CUTLASS Profiler may be compiled with:
|
||||
```bash
|
||||
$ make cutlass_profiler -j
|
||||
```
|
||||
|
||||
To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
|
||||
To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
|
||||
```
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all
|
||||
...
|
||||
$ make cutlass_profiler -j
|
||||
```
|
||||
|
||||
The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
|
||||
The CUTLASS Profiler sources are stored in
|
||||
```bash
|
||||
tools/
|
||||
profiler/
|
||||
```
|
||||
|
||||
The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
|
||||
```bash
|
||||
CUTLASS Performance Tool
|
||||
usage:
|
||||
cutlass_profiler [options]
|
||||
@ -122,7 +122,7 @@ Example:
|
||||
The complete set of arguments available to each operation may be viewed by specifying the operation name
|
||||
in addition to `--help`. The argument flags and their aliases usable for GEMM appear as follows.
|
||||
|
||||
```
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --operation=gemm --help
|
||||
|
||||
GEMM
|
||||
@ -190,7 +190,7 @@ Test your changes to gemm kernels with a quick functional test and save results
|
||||
## Example SGEMM
|
||||
|
||||
Example command line for profiling SGEMM kernels is as follows:
|
||||
```
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096
|
||||
|
||||
=============================
|
||||
@ -202,9 +202,9 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096
|
||||
Disposition: Passed
|
||||
Status: Success
|
||||
|
||||
Arguments: --m=4352 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 \
|
||||
Arguments: --m=4352 --n=4096 --k=4096 --A=f32:column --B=f32:column --C=f32:column --alpha=1 --beta=0 \
|
||||
--split_k_slices=1 --batch_count=1 --op_class=simt --accum=f32 --cta_m=128 --cta_n=128 --cta_k=8 \
|
||||
--stages=2 --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 \
|
||||
--stages=2 --warps_m=2 --warps_n=2 --warps_k=1 --inst_m=1 --inst_n=1 --inst_k=1 --min_cc=50 \
|
||||
--max_cc=1024
|
||||
|
||||
Bytes: 52428800 bytes
|
||||
@ -223,7 +223,7 @@ Note, the arguments which appear in the output may be used as command line param
|
||||
|
||||
To execute kernels targeting Tensor Core operations, supply the flag `--op_class=tensorop` in the command line.
|
||||
|
||||
```
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --op_class=tensorop
|
||||
|
||||
=============================
|
||||
@ -235,9 +235,10 @@ $ ./tools/profiler/cutlass_profiler --op_class=tensorop
|
||||
Disposition: Passed
|
||||
Status: Success
|
||||
|
||||
Arguments: --m=4352 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f16:column --alpha=1 --beta=0 --split_k_slices=1 \
|
||||
--batch_count=1 --op_class=tensorop --accum=f16 --cta_m=128 --cta_n=128 --cta_k=32 --stages=2 \
|
||||
--warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 --max_cc=1024 \
|
||||
Arguments: --m=4352 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f16:column --alpha=1 --beta=0 \
|
||||
--op_class=tensorop --accum=f16 --cta_m=128 --cta_n=128 --cta_k=32 --stages=2 \
|
||||
--warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 \
|
||||
--min_cc=75 --max_cc=1024
|
||||
|
||||
|
||||
Bytes: 52428800 bytes
|
||||
@ -258,7 +259,7 @@ as an inclusive range with the following syntax `start:end:increment` or simply
|
||||
For example, the following sweeps over the range of the GEMM K dimension from 8 to 4096 in increments
|
||||
of 8 elements.
|
||||
|
||||
```
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn --m=4352 --n=4096 --k=8:4096:8
|
||||
```
|
||||
|
||||
@ -268,17 +269,18 @@ By default, runtime and computed GFLOP/s are reported for each operation and pro
|
||||
a table of comma separated values are reported at the end of the execution. This may be output to a file
|
||||
with the `--output=<filename.csv>` command line option as shown:
|
||||
|
||||
```
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn --m=4352 --n=4096 --k=8:4096:8 --output=report.csv
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn \
|
||||
--m=4352 --n=4096 --k=8:4096:8 --output=report.csv
|
||||
```
|
||||
|
||||
To faclitate generation of pivot tables and charts, additional columns may be prepended with the
|
||||
`--tags=<column>:<value>` option. One or more tags may be specified using a comma-delimited list.
|
||||
|
||||
```
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn \
|
||||
--m=4352 --n=4096 --k=8:4096:8 --output=report.csv \
|
||||
--tags=cutlass:2.0,date:2019-11-19
|
||||
```bash
|
||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn \
|
||||
--m=4352 --n=4096 --k=8:4096:8 --output=report.csv \
|
||||
--tags=cutlass:2.0,date:2019-11-19
|
||||
```
|
||||
|
||||
# Copyright
|
||||
|
||||
@ -90,14 +90,15 @@ is able to unroll the loop bodies, map array elements to registers, and construc
|
||||
All loops expected to be unrolled should be annotated with `CUTLASS_PRAGMA_UNROLL` to explicitly direct the compiler
|
||||
to unroll them.
|
||||
|
||||
```
|
||||
```c++
|
||||
int const kN = 8;
|
||||
Array<float, kN> x; // Array we would like to store in registers
|
||||
Array<float, kN> x; // Array we would like to store in registers
|
||||
|
||||
CUTLASS_PRAGMA_UNROLL // Directs the CUDA compiler to unroll this loop.
|
||||
for (int idx = 0; idx < kN; ++idx) { // Loop has constant number of iterations
|
||||
CUTLASS_PRAGMA_UNROLL // Directs the CUDA compiler to unroll this loop.
|
||||
for (int idx = 0; idx < kN; ++idx) { // Loop has constant number of iterations.
|
||||
|
||||
x[i] = float(idx); // Indirect access by induction variable results in direct register access
|
||||
x[i] = float(idx); // Indirect access by induction variable results in
|
||||
// direct register access.
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
@ -90,8 +90,35 @@ $ make test_unit -j
|
||||
[ PASSED ] 946 tests.
|
||||
$
|
||||
```
|
||||
The exact number of tests run is subject to change as we add more functionality.
|
||||
|
||||
No tests should fail.
|
||||
No tests should fail. Unit tests automatically construct the appropriate runtime filters
|
||||
to avoid executing on architectures that do not support all features under test.
|
||||
|
||||
The unit tests are arranged hierarchically mirroring the CUTLASS Template Library. This enables
|
||||
parallelism in building and running tests as well as reducing compilation times when a specific
|
||||
set of tests are desired.
|
||||
|
||||
For example, the following executes strictly the warp-level GEMM tests.
|
||||
```bash
|
||||
$ make test_unit_gemm_warp -j
|
||||
...
|
||||
...
|
||||
[----------] 3 tests from SM75_warp_gemm_tensor_op_congruous_f16
|
||||
[ RUN ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x8_32x128x8_16x8x8
|
||||
[ OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x8_32x128x8_16x8x8 (0 ms)
|
||||
[ RUN ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_64x64x32_16x8x8
|
||||
[ OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_64x64x32_16x8x8 (2 ms)
|
||||
[ RUN ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_32x32x32_16x8x8
|
||||
[ OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_32x32x32_16x8x8 (1 ms)
|
||||
[----------] 3 tests from SM75_warp_gemm_tensor_op_congruous_f16 (3 ms total)
|
||||
...
|
||||
...
|
||||
[----------] Global test environment tear-down
|
||||
[==========] 104 tests from 32 test cases ran. (294 ms total)
|
||||
[ PASSED ] 104 tests.
|
||||
[100%] Built target test_unit_gemm_warp
|
||||
```
|
||||
|
||||
## Using CUTLASS within other applications
|
||||
|
||||
|
||||
Loading…
Reference in New Issue
Block a user