Improved formatting, clarity, and content of several documents. (#64)
* Improved formatting, clarity, and content of several documents.
This commit is contained in:
parent
f4d9c8f755
commit
8aca98f9a7
@ -30,6 +30,7 @@ Fei Hu
|
|||||||
Alan Kaatz
|
Alan Kaatz
|
||||||
Tina Li
|
Tina Li
|
||||||
Timmy Liu
|
Timmy Liu
|
||||||
|
Piotr Majcher
|
||||||
Duane Merrill
|
Duane Merrill
|
||||||
Kevin Siu
|
Kevin Siu
|
||||||
Markus Tavenrath
|
Markus Tavenrath
|
||||||
|
|||||||
@ -91,6 +91,7 @@ CUTLASS 2.0 is described in the following documents and the accompanying
|
|||||||
- [GEMM API](media/docs/gemm_api.md) - describes the CUTLASS GEMM model and C++ template concepts
|
- [GEMM API](media/docs/gemm_api.md) - describes the CUTLASS GEMM model and C++ template concepts
|
||||||
- [Code Organization](media/docs/code_organization.md) - describes the organization and contents of the CUTLASS project
|
- [Code Organization](media/docs/code_organization.md) - describes the organization and contents of the CUTLASS project
|
||||||
- [Terminology](media/docs/terminology.md) - describes terms used in the code
|
- [Terminology](media/docs/terminology.md) - describes terms used in the code
|
||||||
|
- [Programming Guidelines](media/docs/programming_guidelines.md) - guidelines for writing efficient modern CUDA C++
|
||||||
- [Fundamental types](media/docs/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
|
- [Fundamental types](media/docs/fundamental_types.md) - describes basic C++ classes used in CUTLASS to represent numeric quantities and arrays
|
||||||
- [Layouts](media/docs/layout.md) - describes layouts of matrices and tensors in memory
|
- [Layouts](media/docs/layout.md) - describes layouts of matrices and tensors in memory
|
||||||
- [Tile Iterators](media/docs/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
|
- [Tile Iterators](media/docs/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
|
||||||
|
|||||||
@ -82,7 +82,7 @@ scripts to span the design space.
|
|||||||
```
|
```
|
||||||
tools/
|
tools/
|
||||||
library/ # static/dynamic library containing all kernel instantiations of interest
|
library/ # static/dynamic library containing all kernel instantiations of interest
|
||||||
# (with some build-level filter switches to compile specific subsets, perhaps by architecture)
|
# (with some build-level filter switches to compile specific subsets)
|
||||||
|
|
||||||
include/
|
include/
|
||||||
cutlass/
|
cutlass/
|
||||||
@ -99,18 +99,27 @@ tools/
|
|||||||
manifest.py
|
manifest.py
|
||||||
|
|
||||||
src/
|
src/
|
||||||
|
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Examples
|
|
||||||
|
|
||||||
To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`.
|
|
||||||
|
|
||||||
When CMake is executed, the CUTLASS Instance Library generator scripts are executed to construct a set of
|
When CMake is executed, the CUTLASS Instance Library generator scripts are executed to construct a set of
|
||||||
instantiations in `build/tools/library/generated/`.
|
instantiations in `build/tools/library/generated/`.
|
||||||
|
|
||||||
The CUTLASS Profiler is designed to initialize the CUTLASS Instance Library and execute all operations contained therein.
|
### CUTLASS Profiler
|
||||||
|
|
||||||
|
The CUTLASS Profiler is designed to load the CUTLASS Instance Library and execute all operations contained therein.
|
||||||
|
This command-line driven application constructs an execution environment for evaluating functionality and performance.
|
||||||
|
It is implemented in
|
||||||
|
```
|
||||||
|
tools/
|
||||||
|
profiler/
|
||||||
|
```
|
||||||
|
|
||||||
|
and may be built as follows.
|
||||||
|
```
|
||||||
|
$ make cutlass_profiler -j
|
||||||
|
```
|
||||||
|
|
||||||
|
[Further details about the CUTLASS Profiler are described here.](/media/docs/profiler.md)
|
||||||
|
|
||||||
### CUTLASS Utilities
|
### CUTLASS Utilities
|
||||||
|
|
||||||
@ -122,9 +131,13 @@ tools/
|
|||||||
include/
|
include/
|
||||||
cutlass/
|
cutlass/
|
||||||
util/ # CUTLASS Utility companion library
|
util/ # CUTLASS Utility companion library
|
||||||
reference/ # reference implementation of CUTLASS operators - minimal consideration for performance
|
|
||||||
|
reference/ # functional reference implementation of CUTLASS operators
|
||||||
|
# (minimal consideration for performance)
|
||||||
|
|
||||||
detail/
|
detail/
|
||||||
*
|
*
|
||||||
|
|
||||||
device/ # device-side reference implementations of CUTLASS operators
|
device/ # device-side reference implementations of CUTLASS operators
|
||||||
thread/
|
thread/
|
||||||
kernel/
|
kernel/
|
||||||
@ -134,28 +147,36 @@ tools/
|
|||||||
*
|
*
|
||||||
```
|
```
|
||||||
|
|
||||||
[More details about CUTLASS Utilities may be found here.](media/docs/utilities.md)
|
[More details about CUTLASS Utilities may be found here.](/media/docs/utilities.md)
|
||||||
|
|
||||||
### CUTLASS Profiler
|
|
||||||
|
|
||||||
This is application constructs an execution environment for evaluating the functionality and performance of
|
|
||||||
CUTLASS components. It is implemented in
|
|
||||||
```
|
|
||||||
tools/
|
|
||||||
profiler/
|
|
||||||
```
|
|
||||||
|
|
||||||
and may be built as follows.
|
|
||||||
```
|
|
||||||
$ make cutlass_profiler -j
|
|
||||||
```
|
|
||||||
|
|
||||||
[Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)
|
|
||||||
|
|
||||||
## Examples
|
## Examples
|
||||||
|
|
||||||
To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`.
|
To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`.
|
||||||
|
|
||||||
|
CUTLASS SDK examples apply CUTLASS templates to implement basic computations.
|
||||||
|
|
||||||
|
```
|
||||||
|
examples/
|
||||||
|
00_basic_gemm/ # launches a basic GEMM with single precision inputs and outputs
|
||||||
|
|
||||||
|
01_cutlass_utilities/ # demonstrates CUTLASS Utilities for allocating and initializing tensors
|
||||||
|
|
||||||
|
02_dump_reg_smem/ # debugging utilities for printing register and shared memory contents
|
||||||
|
|
||||||
|
03_visualize_layout/ # utility for visualizing all layout functions in CUTLASS
|
||||||
|
|
||||||
|
04_tile_iterator/ # example demonstrating an iterator over tiles in memory
|
||||||
|
|
||||||
|
05_batched_gemm/ # example demonstrating CUTLASS's batched strided GEMM operation
|
||||||
|
|
||||||
|
06_splitK_gemm/ # exmaple demonstrating CUTLASS's Split-K parallel reduction kernel
|
||||||
|
|
||||||
|
07_volta_tensorop_gemm/ # example demonstrating mixed precision GEMM using Volta Tensor Cores
|
||||||
|
|
||||||
|
08_turing_tensorop_gemm/ # example demonstrating integer GEMM using Turing Tensor Cores
|
||||||
|
```
|
||||||
|
|
||||||
## Media
|
## Media
|
||||||
|
|
||||||
This directory contains documentation, images, and performance result data which accompanies the CUTLASS library and components.
|
This directory contains documentation, images, and performance result data which accompanies the CUTLASS library and components.
|
||||||
|
|||||||
@ -1,5 +1,3 @@
|
|||||||

|
|
||||||
|
|
||||||
# CUTLASS 2.0
|
# CUTLASS 2.0
|
||||||
|
|
||||||
_CUTLASS 2.0 - November 2019_
|
_CUTLASS 2.0 - November 2019_
|
||||||
|
|||||||
@ -173,7 +173,8 @@ struct Mma {
|
|||||||
/// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)
|
/// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)
|
||||||
struct FragmentB;
|
struct FragmentB;
|
||||||
|
|
||||||
/// Iterator of C operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
|
/// Iterator of C operand in shared memory -
|
||||||
|
/// satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
|
||||||
struct IteratorC;
|
struct IteratorC;
|
||||||
|
|
||||||
/// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)
|
/// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)
|
||||||
@ -322,7 +323,8 @@ struct Mma {
|
|||||||
/// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)
|
/// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)
|
||||||
struct FragmentB;
|
struct FragmentB;
|
||||||
|
|
||||||
/// Iterator of C operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
|
/// Iterator of C operand in shared memory -
|
||||||
|
/// satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
|
||||||
struct IteratorC;
|
struct IteratorC;
|
||||||
|
|
||||||
/// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)
|
/// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)
|
||||||
|
|||||||
@ -7,27 +7,27 @@
|
|||||||
The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
|
The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
|
||||||
defined in the CUTLASS Instance Library.
|
defined in the CUTLASS Instance Library.
|
||||||
|
|
||||||
The CUTLASS Profiler sources are stored in
|
The CUTLASS Profiler may be compiled with:
|
||||||
```
|
```bash
|
||||||
tools/
|
|
||||||
profiler/
|
|
||||||
```
|
|
||||||
|
|
||||||
and may be compiled as follows.
|
|
||||||
```
|
|
||||||
$ make cutlass_profiler -j
|
$ make cutlass_profiler -j
|
||||||
```
|
```
|
||||||
|
|
||||||
To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
|
To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
|
||||||
To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
|
To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
|
||||||
```
|
```bash
|
||||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all
|
$ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=all
|
||||||
...
|
...
|
||||||
$ make cutlass_profiler -j
|
$ make cutlass_profiler -j
|
||||||
```
|
```
|
||||||
|
|
||||||
The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
|
The CUTLASS Profiler sources are stored in
|
||||||
|
```bash
|
||||||
|
tools/
|
||||||
|
profiler/
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The CUTLASS Profiler usage statement may be obtained by executing `cutlass_profiler --help` and appears as follows.
|
||||||
|
```bash
|
||||||
CUTLASS Performance Tool
|
CUTLASS Performance Tool
|
||||||
usage:
|
usage:
|
||||||
cutlass_profiler [options]
|
cutlass_profiler [options]
|
||||||
@ -122,7 +122,7 @@ Example:
|
|||||||
The complete set of arguments available to each operation may be viewed by specifying the operation name
|
The complete set of arguments available to each operation may be viewed by specifying the operation name
|
||||||
in addition to `--help`. The argument flags and their aliases usable for GEMM appear as follows.
|
in addition to `--help`. The argument flags and their aliases usable for GEMM appear as follows.
|
||||||
|
|
||||||
```
|
```bash
|
||||||
$ ./tools/profiler/cutlass_profiler --operation=gemm --help
|
$ ./tools/profiler/cutlass_profiler --operation=gemm --help
|
||||||
|
|
||||||
GEMM
|
GEMM
|
||||||
@ -190,7 +190,7 @@ Test your changes to gemm kernels with a quick functional test and save results
|
|||||||
## Example SGEMM
|
## Example SGEMM
|
||||||
|
|
||||||
Example command line for profiling SGEMM kernels is as follows:
|
Example command line for profiling SGEMM kernels is as follows:
|
||||||
```
|
```bash
|
||||||
$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096
|
$ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=4352 --n=4096 --k=4096
|
||||||
|
|
||||||
=============================
|
=============================
|
||||||
@ -223,7 +223,7 @@ Note, the arguments which appear in the output may be used as command line param
|
|||||||
|
|
||||||
To execute kernels targeting Tensor Core operations, supply the flag `--op_class=tensorop` in the command line.
|
To execute kernels targeting Tensor Core operations, supply the flag `--op_class=tensorop` in the command line.
|
||||||
|
|
||||||
```
|
```bash
|
||||||
$ ./tools/profiler/cutlass_profiler --op_class=tensorop
|
$ ./tools/profiler/cutlass_profiler --op_class=tensorop
|
||||||
|
|
||||||
=============================
|
=============================
|
||||||
@ -235,9 +235,10 @@ $ ./tools/profiler/cutlass_profiler --op_class=tensorop
|
|||||||
Disposition: Passed
|
Disposition: Passed
|
||||||
Status: Success
|
Status: Success
|
||||||
|
|
||||||
Arguments: --m=4352 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f16:column --alpha=1 --beta=0 --split_k_slices=1 \
|
Arguments: --m=4352 --n=4096 --k=4096 --A=f16:column --B=f16:row --C=f16:column --alpha=1 --beta=0 \
|
||||||
--batch_count=1 --op_class=tensorop --accum=f16 --cta_m=128 --cta_n=128 --cta_k=32 --stages=2 \
|
--op_class=tensorop --accum=f16 --cta_m=128 --cta_n=128 --cta_k=32 --stages=2 \
|
||||||
--warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 --min_cc=75 --max_cc=1024 \
|
--warps_m=2 --warps_n=2 --warps_k=1 --inst_m=16 --inst_n=8 --inst_k=8 \
|
||||||
|
--min_cc=75 --max_cc=1024
|
||||||
|
|
||||||
|
|
||||||
Bytes: 52428800 bytes
|
Bytes: 52428800 bytes
|
||||||
@ -258,7 +259,7 @@ as an inclusive range with the following syntax `start:end:increment` or simply
|
|||||||
For example, the following sweeps over the range of the GEMM K dimension from 8 to 4096 in increments
|
For example, the following sweeps over the range of the GEMM K dimension from 8 to 4096 in increments
|
||||||
of 8 elements.
|
of 8 elements.
|
||||||
|
|
||||||
```
|
```bash
|
||||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn --m=4352 --n=4096 --k=8:4096:8
|
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn --m=4352 --n=4096 --k=8:4096:8
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -268,14 +269,15 @@ By default, runtime and computed GFLOP/s are reported for each operation and pro
|
|||||||
a table of comma separated values are reported at the end of the execution. This may be output to a file
|
a table of comma separated values are reported at the end of the execution. This may be output to a file
|
||||||
with the `--output=<filename.csv>` command line option as shown:
|
with the `--output=<filename.csv>` command line option as shown:
|
||||||
|
|
||||||
```
|
```bash
|
||||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn --m=4352 --n=4096 --k=8:4096:8 --output=report.csv
|
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn \
|
||||||
|
--m=4352 --n=4096 --k=8:4096:8 --output=report.csv
|
||||||
```
|
```
|
||||||
|
|
||||||
To faclitate generation of pivot tables and charts, additional columns may be prepended with the
|
To faclitate generation of pivot tables and charts, additional columns may be prepended with the
|
||||||
`--tags=<column>:<value>` option. One or more tags may be specified using a comma-delimited list.
|
`--tags=<column>:<value>` option. One or more tags may be specified using a comma-delimited list.
|
||||||
|
|
||||||
```
|
```bash
|
||||||
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn \
|
$ ./tools/profiler/cutlass_profiler --kernels=cutlass_simt_sgemm_128x128_nn \
|
||||||
--m=4352 --n=4096 --k=8:4096:8 --output=report.csv \
|
--m=4352 --n=4096 --k=8:4096:8 --output=report.csv \
|
||||||
--tags=cutlass:2.0,date:2019-11-19
|
--tags=cutlass:2.0,date:2019-11-19
|
||||||
|
|||||||
@ -90,14 +90,15 @@ is able to unroll the loop bodies, map array elements to registers, and construc
|
|||||||
All loops expected to be unrolled should be annotated with `CUTLASS_PRAGMA_UNROLL` to explicitly direct the compiler
|
All loops expected to be unrolled should be annotated with `CUTLASS_PRAGMA_UNROLL` to explicitly direct the compiler
|
||||||
to unroll them.
|
to unroll them.
|
||||||
|
|
||||||
```
|
```c++
|
||||||
int const kN = 8;
|
int const kN = 8;
|
||||||
Array<float, kN> x; // Array we would like to store in registers
|
Array<float, kN> x; // Array we would like to store in registers
|
||||||
|
|
||||||
CUTLASS_PRAGMA_UNROLL // Directs the CUDA compiler to unroll this loop.
|
CUTLASS_PRAGMA_UNROLL // Directs the CUDA compiler to unroll this loop.
|
||||||
for (int idx = 0; idx < kN; ++idx) { // Loop has constant number of iterations
|
for (int idx = 0; idx < kN; ++idx) { // Loop has constant number of iterations.
|
||||||
|
|
||||||
x[i] = float(idx); // Indirect access by induction variable results in direct register access
|
x[i] = float(idx); // Indirect access by induction variable results in
|
||||||
|
// direct register access.
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@ -90,8 +90,35 @@ $ make test_unit -j
|
|||||||
[ PASSED ] 946 tests.
|
[ PASSED ] 946 tests.
|
||||||
$
|
$
|
||||||
```
|
```
|
||||||
|
The exact number of tests run is subject to change as we add more functionality.
|
||||||
|
|
||||||
No tests should fail.
|
No tests should fail. Unit tests automatically construct the appropriate runtime filters
|
||||||
|
to avoid executing on architectures that do not support all features under test.
|
||||||
|
|
||||||
|
The unit tests are arranged hierarchically mirroring the CUTLASS Template Library. This enables
|
||||||
|
parallelism in building and running tests as well as reducing compilation times when a specific
|
||||||
|
set of tests are desired.
|
||||||
|
|
||||||
|
For example, the following executes strictly the warp-level GEMM tests.
|
||||||
|
```bash
|
||||||
|
$ make test_unit_gemm_warp -j
|
||||||
|
...
|
||||||
|
...
|
||||||
|
[----------] 3 tests from SM75_warp_gemm_tensor_op_congruous_f16
|
||||||
|
[ RUN ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x8_32x128x8_16x8x8
|
||||||
|
[ OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x8_32x128x8_16x8x8 (0 ms)
|
||||||
|
[ RUN ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_64x64x32_16x8x8
|
||||||
|
[ OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_64x64x32_16x8x8 (2 ms)
|
||||||
|
[ RUN ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_32x32x32_16x8x8
|
||||||
|
[ OK ] SM75_warp_gemm_tensor_op_congruous_f16.128x128x32_32x32x32_16x8x8 (1 ms)
|
||||||
|
[----------] 3 tests from SM75_warp_gemm_tensor_op_congruous_f16 (3 ms total)
|
||||||
|
...
|
||||||
|
...
|
||||||
|
[----------] Global test environment tear-down
|
||||||
|
[==========] 104 tests from 32 test cases ran. (294 ms total)
|
||||||
|
[ PASSED ] 104 tests.
|
||||||
|
[100%] Built target test_unit_gemm_warp
|
||||||
|
```
|
||||||
|
|
||||||
## Using CUTLASS within other applications
|
## Using CUTLASS within other applications
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user