6.7 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	README > CUTLASS 3.0 Design and Hierarchy
CUTLASS 3.0 Design
CUTLASS 3.0 is a major enhancement over the abstractions of CUTLASS 2.x and aims to make usage of all layers of the GEMM hierarchy easier and more composable while still achieving peak performance on Hardware.
CUTLASS 3.0 design goals
CUTLASS 3.0 has the following design goals, in no particular order.
- 
Simplify expressing and manipulating data and thread layouts across the GEMM hierarchy with CuTe layouts and layout algebra. 
- 
Improve code readability and learning curve by reducing the number of named types. 
- 
Functional correctness by default, actionable static asserts otherwise. 
- 
Single, clear points of performance tuning and custom kernel extensions. 
- 
Support for NVIDIA Hopper GPUs with great performance using features such as Tensor Cores, tensor memory accelerator, and thread block clusters. 
A new Conceptual GEMM Hierarchy
CUTLASS 2.x decomposes the moving parts of a GEMM operation across a hierarchy that closely mirrors the organization of GPU architectures. This discussed in detail within the CUTLASS 2.x GEMM API documentation. This design, however, sometimes results in a coupling that is too tight to extend to newer GPU features that might not fit into the same architectural hierarchy. For instance, Hopper's warp-group wide instructions do not naturally fit into any warp or thread layer GEMM concept in CUTLASS 2.x. Even for Volta tensor cores, instructions that atomically exist at the quad-pair granularity are first tiled at the warp level before use. This hints at the brittleness of the abstraction power.
CUTLASS 3.0 detaches its interface layers from the hardware, centering them instead around the natural structure of GEMM algorithms not tied to any particular GPU generation. This makes CUTLASS's code more robust to GPU architecture evolution, less prone to implementation detail leakage, and provides users with a consistent interface to hardware acceleration regardless of the architecture specific details.
The new conceptual GEMM hierarchy is discussed in detail in the dedicated CUTLASS 3.0 GEMM API documentation readme, along with code examples of the core concepts and types.
Adoption of CuTe Layout and Tensors
CUTLASS 3.0 introduces a new core library, CuTe, to describe and manipulate tensors of threads and data.
CuTe is a collection of C++ CUDA template abstractions for defining and operating on hierarchically multidimensional layouts of threads and data. CuTe provides Layout and Tensor objects that compactly packages the type, shape, memory space, and layout of data, while performing the complicated indexing for the user.
CUTLASS 3.0 adopts CuTe throughout the GEMM hierarchy in its templates, greatly simplifying the design, improving code composability, and readability. More documentation specific to CuTe can be found in its dedicated documentation directory.
Programming massively parallel systems with various layers of logical thread and data hierarchies is not a trivial task.
- cute::Layouts always maintain logical consistency of their coordinates, allowing us to check pre- and post-conditions at compile time for all static inner loops.
- Explicit thread to data mapping allows users and kernel authors to inspect and reason about operations from a single point in the source code.
- Layouts provide a single point of performance tuning, as most optimizations can be done by careful selection of thread and data layouts.
- Formalized algebra makes manipulation of and reasoning about thread->data mapping explicit in source code.
- Single vocabulary type (cute::Layout) subsumes every iterator and layout in CUTLASS 2.x CUTLASS 2.x uses many bespoke thread maps, iterators, and data layouts. Iterators are fundamentally 1-D, whereas most layouts we encounter in the GPU hierarchy are fundamentally n-D.
Reducing the number of named types and iterator concepts
CUTLASS 2.x design preferred introducing bespoke named types for each
architecture specific thread and data layout. For instance, gemm::treadblock namespace
contains implementation for MmaMultistage, MmaPlanarComplexMultistage, MmaPipelined etc.
despite them providing mainloops for GEMMs. To spell these types the same way in generic code,
CUTLASS 2.x provides aliases through its default_x_configuration.h files, however,
these aliases make the code much harder to read as the user has to perform type substitution
mentally in order to understand the codebase.
CUTLASS 3.0 greatly reduces the number of named types used throughout by
- Replacing all iterator concepts for all memory domains with cute::Tensors
- Dispatching mainloop and epilogue implementations on tag-dispatch policies rather than naming new types
- Dispatching kernel layer schedules on tag-dispatch policies rather than naming new types
Reducing the number of named types has many benefits:
- It makes writing generic code easier, as the primary type names share the same lexical without aliasing through configuration providers.
- It flattens the learning curve of CUTLASS by greatly reducing the mental context required as the library only exposes a handful of named types.
- It provides a clear, singular extension point for users to plug in their customizations through the dispatch policies.
Correctness by default, Performance through clear, individual points of tuning
CUTLASS 2.x maintained its thread layouts as implicit indexing math implemented as a part of 1D iterators. This meant that the thread to data layout mapping was implicit in the imperative structure of the C++ code itself and did not have a formal algebra we could use to manipulate these mappings. Each iterator had to re-implement its indexing and mapping logic. This made it hard to learn how this mapping was performed for existing iterators, and even harder to implement custom layout functions for the core inner loops of a GEMM.
CUTLASS 3.0 replaces all iterator concepts from CUTLASS 2.x with a single layout type for thread and data tensors. CuTe's formalized layout algebra is then used at every layer of the GEMM hierarchy to manipulate the mapping between the two. CuTe layouts always maintain logical consistency, and for fully static layouts (such as in the core unrolled inner loops), provide compile time checks that break builds if this consistency is violated. In this way, CuTe reifies the thread-to-data-layout mapping, makes it easier to write code that is "correct by construction". If the code compiles, it's probably correct.
 
			