
CUTLASS 2.0 Substantially refactored for - Better performance, particularly for native Turing Tensor Cores - Robust and durable templates spanning the design space - Encapsulated functionality embodying modern C++11 programming techniques - Optimized containers and data types for efficient, generic, portable device code Updates to: - Quick start guide - Documentation - Utilities - CUTLASS Profiler Native Turing Tensor Cores - Efficient GEMM kernels targeting Turing Tensor Cores - Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands Coverage of existing CUTLASS functionality: - GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs - Volta Tensor Cores through native mma.sync and through WMMA API - Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions - Batched GEMM operations - Complex-valued GEMMs Note: this commit and all that follow require a host compiler supporting C++11 or greater.
216 lines
8.9 KiB
Markdown
216 lines
8.9 KiB
Markdown

|
|
|
|
[README](/README.md#documentation) > **Code Organization**
|
|
|
|
# CUTLASS Code Organization
|
|
|
|
This document describes the layout of the CUTLASS repository. The main components are:
|
|
|
|
* **CUTLASS Template Library** - CUDA Templates for Linear Algebra Subroutines and Solvers (header only)
|
|
* **CUTLASS Utilities** - Additional templates
|
|
* **CUTLASS Instance Library** - instantiations of CUTLASS templates covering the design space
|
|
* **CUTLASS Profiler** - CUTLASS Library, Profiler, and Utilities
|
|
* **Examples** - SDK examples of CUTLASS Template Library and components
|
|
* **Media** - supporting documentation and media content
|
|
* **Tests** - test components for CUTLASS Template Library and tools
|
|
|
|
## CUTLASS Template Library
|
|
|
|
CUDA Templates for Linear Algebra Subroutines and Solvers is a library of CUDA C++ template classes for
|
|
performing efficient matrix computations on NVIDIA GPUs.
|
|
|
|
Like NVIDIA CUB, the components of CUTLASS are organized hierarchically based on the scope of cooperative
|
|
elements. For example, warp-level GEMM components perform a matrix multiply collectively by the
|
|
set of threads within a warp. The following figure illustrates each layer.
|
|
|
|
Components are designed to be usable by client applications accessing functionailty at each scope.
|
|
|
|
CUTLASS Templates are implemented by header files in the following directory structure:
|
|
|
|
```
|
|
include/ # Top-level include directory. Client applications should target this path.
|
|
|
|
cutlass/ # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only
|
|
|
|
arch/ # direct exposure of architecture features (including instruction-level GEMMs)
|
|
*
|
|
gemm/ # code specialized for general matrix product computations
|
|
thread/ # thread-level operators
|
|
warp/ # warp-level operators
|
|
threadblock/ # CTA-level operators
|
|
kernel/ # CUDA kernel entry points
|
|
device/ # launches kernel(s) over a full device
|
|
* # scope-agnostic components and basic vocabular type definitions for GEMM
|
|
|
|
layout/ # layout definitions for matrices, tensors, and other mathematical objects in memory
|
|
*
|
|
|
|
reduction/ # bandwidth-limited reduction kernels that do not fit the "gemm" models
|
|
thread/ # thread-level operators
|
|
warp/ # warp-level operators
|
|
threadblock/ # CTA-level operators
|
|
kernel/ # CUDA kernel entry points
|
|
device/ # launches kernel(s) over a full device
|
|
* # scope-agnostic components and basic vocabular type definitions
|
|
|
|
transform/ # code specialized for layout, type, and domain transformations
|
|
thread/ # thread-level operators
|
|
warp/ # warp-level operators
|
|
threadblock/ # CTA-level operators
|
|
kernel/ # CUDA kernel entry points
|
|
device/ # launches kernel(s) over a full device
|
|
* # scope-agnostic components and basic vocabulary type definitions
|
|
|
|
util/ # miscellaneous CUTLASS components
|
|
*
|
|
* # core vocabulary types and fundamental arithmetic operators
|
|
```
|
|
|
|
See [Programming Guidelines](/media/docs/programming_guidelines.md) for further details about
|
|
conventions and design patterns used throughout CUTLASS.
|
|
|
|
## Tools
|
|
|
|
The `tools/` directory contains clients of the CUTLASS Template library and includes the following.
|
|
|
|
## CUTLASS Instance Library
|
|
|
|
The CUTLASS Instance Library contains instantiations of the above CUTLASS templates covering supported configurations,
|
|
data types, block structure, and tile sizes. These instantiations are procedurally generated using a set of
|
|
scripts to span the design space.
|
|
|
|
```
|
|
tools/
|
|
library/ # static/dynamic library containing all kernel instantiations of interest
|
|
# (with some build-level filter switches to compile specific subsets, perhaps by architecture)
|
|
|
|
include/
|
|
cutlass/
|
|
library/ # header files for CUTLASS Deliverables Library (in cutlass::library:: namespace)
|
|
|
|
library.h # defines enums and structs to describe the tiled structure of operator instances
|
|
manifest.h # collection of all instances
|
|
|
|
scripts/ # scripts to procedurally generate CUTLASS template instances
|
|
|
|
gemm_operations.py
|
|
library.py
|
|
generator.py # entry point of procedural generation scripts - invoked by cmake
|
|
manifest.py
|
|
|
|
src/
|
|
|
|
|
|
```
|
|
|
|
## Examples
|
|
|
|
To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`.
|
|
|
|
When CMake is executed, the CUTLASS Instance Library generator scripts are executed to construct a set of
|
|
instantiations in `build/tools/library/generated/`.
|
|
|
|
The CUTLASS Profiler is designed to initialize the CUTLASS Instance Library and execute all operations contained therein.
|
|
|
|
### CUTLASS Utilities
|
|
|
|
`tools/util/` defines a companion library of headers and sources that support the CUTLASS test programs, examples, and other client applications. Its structure is as follows:
|
|
|
|
```
|
|
tools/
|
|
util/
|
|
include/
|
|
cutlass/
|
|
util/ # CUTLASS Utility companion library
|
|
reference/ # reference implementation of CUTLASS operators - minimal consideration for performance
|
|
detail/
|
|
*
|
|
device/ # device-side reference implementations of CUTLASS operators
|
|
thread/
|
|
kernel/
|
|
*
|
|
host/ # host-side reference implementations of CUTLASS operators
|
|
*
|
|
*
|
|
```
|
|
|
|
[More details about CUTLASS Utilities may be found here.](media/docs/utilities.md)
|
|
|
|
### CUTLASS Profiler
|
|
|
|
This is application constructs an execution environment for evaluating the functionality and performance of
|
|
CUTLASS components. It is implemented in
|
|
```
|
|
tools/
|
|
profiler/
|
|
```
|
|
|
|
and may be built as follows.
|
|
```
|
|
$ make cutlass_profiler -j
|
|
```
|
|
|
|
[Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md)
|
|
|
|
## Examples
|
|
|
|
To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`.
|
|
|
|
## Media
|
|
|
|
This directory contains documentation, images, and performance result data which accompanies the CUTLASS library and components.
|
|
|
|
## Tests
|
|
|
|
Test programs for CUTLASS. Tests are organized hierarchically, mirroring the organization of source files.
|
|
```
|
|
test/ # unit tests for CUTLASS Template Library
|
|
unit/
|
|
arch/
|
|
core/
|
|
gemm/
|
|
device/
|
|
kernel/
|
|
thread/
|
|
threadblock/
|
|
warp/
|
|
reduction/
|
|
kernel/
|
|
thread/
|
|
transform/
|
|
threadblock/
|
|
*
|
|
```
|
|
Tests can be built and run at the top level scope by invoking `make test_unit` or by building
|
|
and explicitly executing each individual target, e.g. `cutlass_test_unit_gemm_device`.
|
|
|
|
Tests are configured to specify appropriate GTest filter strings to avoid running except on
|
|
architectures where they are expected to pass. Thus, no tests should fail. The actual number
|
|
of tests run may vary over time as more are added.
|
|
|
|
# Copyright
|
|
|
|
Copyright (c) 2017-2019, NVIDIA CORPORATION. All rights reserved.
|
|
|
|
```
|
|
Redistribution and use in source and binary forms, with or without modification, are permitted
|
|
provided that the following conditions are met:
|
|
* Redistributions of source code must retain the above copyright notice, this list of
|
|
conditions and the following disclaimer.
|
|
* Redistributions in binary form must reproduce the above copyright notice, this list of
|
|
conditions and the following disclaimer in the documentation and/or other materials
|
|
provided with the distribution.
|
|
* Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used
|
|
to endorse or promote products derived from this software without specific prior written
|
|
permission.
|
|
|
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
|
|
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
|
|
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE
|
|
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
|
|
BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
|
|
OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
|
|
STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
|
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
|
```
|