118 lines
6.7 KiB
Markdown
118 lines
6.7 KiB
Markdown
[README](../../README.md#documentation) > **CUTLASS 3.0 Design and Hierarchy**
|
|
|
|
# CUTLASS 3.0 Design
|
|
|
|
CUTLASS 3.0 is a major enhancement over the abstractions of CUTLASS 2.x
|
|
and aims to make usage of all layers of the GEMM hierarchy easier and more composable
|
|
while still achieving peak performance on Hardware.
|
|
|
|
## CUTLASS 3.0 design goals
|
|
|
|
CUTLASS 3.0 has the following design goals, in no particular order.
|
|
|
|
- Simplify expressing and manipulating data and thread layouts across
|
|
the GEMM hierarchy with CuTe layouts and layout algebra.
|
|
|
|
- Improve code readability and learning curve by
|
|
reducing the number of named types.
|
|
|
|
- Functional correctness by default,
|
|
actionable static asserts otherwise.
|
|
|
|
- Single, clear points of performance tuning and custom kernel extensions.
|
|
|
|
- Support for NVIDIA Hopper GPUs with great performance using
|
|
features such as Tensor Cores, tensor memory accelerator, and thread block clusters.
|
|
|
|
## A new Conceptual GEMM Hierarchy
|
|
|
|
CUTLASS 2.x decomposes the moving parts of a GEMM operation
|
|
across a hierarchy that closely mirrors the organization of GPU
|
|
architectures. This discussed in detail within the
|
|
[CUTLASS 2.x GEMM API documentation](/media/docs/gemm_api.md).
|
|
This design, however, sometimes results in a coupling that is too tight
|
|
to extend to newer GPU features that might not fit into the same architectural
|
|
hierarchy. For instance, Hopper's warp-group wide instructions do not naturally
|
|
fit into any warp or thread layer GEMM concept in CUTLASS 2.x. Even for Volta tensor cores,
|
|
instructions that atomically exist at the quad-pair granularity are first tiled at
|
|
the warp level before use. This hints at the brittleness of the abstraction power.
|
|
|
|
CUTLASS 3.0 detaches its interface layers from the hardware,
|
|
centering them instead around the natural structure of GEMM algorithms
|
|
not tied to any particular GPU generation.
|
|
This makes CUTLASS's code more robust to GPU architecture evolution,
|
|
less prone to implementation detail leakage, and provides users
|
|
with a consistent interface to hardware acceleration regardless of
|
|
the architecture specific details.
|
|
|
|
The new conceptual GEMM hierarchy is discussed in detail in the dedicated
|
|
[CUTLASS 3.0 GEMM API documentation readme](/media/docs/gemm_api_3x.md),
|
|
along with code examples of the core concepts and types.
|
|
|
|
## Adoption of CuTe Layout and Tensors
|
|
|
|
CUTLASS 3.0 introduces a new core library, CuTe, to describe and manipulate tensors of threads and data.
|
|
CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides `Layout` and `Tensor` objects that compactly packages the type, shape, memory space, and layout of data, while performing the complicated indexing for the user.
|
|
|
|
CUTLASS 3.0 adopts CuTe throughout the GEMM hierarchy in its templates, greatly simplifying the design,
|
|
improving code composability, and readability. More documentation specific to CuTe can be found in its [dedicated documentation directory](/media/docs/cute/00_quickstart.md).
|
|
|
|

|
|
|
|
Programming massively parallel systems with various layers of logical thread and data hierarchies is not a trivial task.
|
|
|
|
- `cute::Layout`s always maintain logical consistency of their coordinates,
|
|
allowing us to check pre- and post-conditions at compile time for all static inner loops.
|
|
- Explicit thread to data mapping allows users and kernel authors to inspect and reason about operations
|
|
from a single point in the source code.
|
|
- Layouts provide a single point of performance tuning, as most optimizations can be done by careful
|
|
selection of thread and data layouts.
|
|
- Formalized algebra makes manipulation of and reasoning about thread->data mapping explicit in source code.
|
|
- Single vocabulary type (`cute::Layout`) subsumes every iterator and layout in CUTLASS 2.x CUTLASS 2.x uses many bespoke thread maps, iterators, and data layouts. Iterators are fundamentally 1-D, whereas most layouts we encounter in the GPU hierarchy are fundamentally n-D.
|
|
|
|
## Reducing the number of named types and iterator concepts
|
|
|
|
CUTLASS 2.x design preferred introducing bespoke named types for each
|
|
architecture specific thread and data layout. For instance, `gemm::treadblock` namespace
|
|
contains implementation for `MmaMultistage`, `MmaPlanarComplexMultistage`, `MmaPipelined` etc.
|
|
despite them providing mainloops for GEMMs. To spell these types the same way in generic code,
|
|
CUTLASS 2.x provides aliases through its `default_x_configuration.h` files, however,
|
|
these aliases make the code much harder to read as the user has to perform type substitution
|
|
mentally in order to understand the codebase.
|
|
|
|
CUTLASS 3.0 greatly reduces the number of named types used throughout by
|
|
|
|
- Replacing all iterator concepts for all memory domains with `cute::Tensor`s
|
|
- Dispatching mainloop and epilogue implementations on tag-dispatch policies rather than naming new types
|
|
- Dispatching kernel layer schedules on tag-dispatch policies rather than naming new types
|
|
|
|
Reducing the number of named types has many benefits:
|
|
|
|
- It *makes writing generic code easier*, as the primary type names share the same lexical
|
|
without aliasing through configuration providers.
|
|
- It *flattens the learning curve of CUTLASS* by greatly reducing the mental context required
|
|
as the library only exposes a handful of named types.
|
|
- It *provides a clear, singular extension point* for users to plug in their customizations
|
|
through the dispatch policies.
|
|
|
|
## Correctness by default, Performance through clear, individual points of tuning
|
|
|
|
CUTLASS 2.x maintained its thread layouts as implicit indexing math implemented
|
|
as a part of 1D iterators. This meant that the thread to data layout mapping
|
|
was implicit in the imperative structure of the C++ code itself and did not have
|
|
a formal algebra we could use to manipulate these mappings. Each iterator
|
|
had to re-implement its indexing and mapping logic. This made it hard to learn
|
|
how this mapping was performed for existing iterators, and even harder to
|
|
implement custom layout functions for the core inner loops of a GEMM.
|
|
|
|
CUTLASS 3.0 replaces all iterator concepts from CUTLASS 2.x
|
|
with a single layout type for thread and data tensors.
|
|
CuTe's formalized layout algebra is then used at every layer of
|
|
the GEMM hierarchy to manipulate the mapping between the two.
|
|
CuTe layouts always maintain logical consistency, and for fully static layouts
|
|
(such as in the core unrolled inner loops), provide
|
|
compile time checks that break builds if this consistency is violated.
|
|
In this way, CuTe reifies the thread-to-data-layout mapping,
|
|
makes it easier to write code that is "correct by construction".
|
|
If the code compiles, it's probably correct.
|