![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Programming Guidelines") [README](/README.md#documentation) > **Programming Guidelines** # Programming Guidelines ## Hierarchical Organization CUTLASS embodies a design paradigm exemplified by the [CUB library](https://nvlabs.github.io/cub/) for expressing collective operations. Objects expose an interface for a problem that is then decomposed into concurrent subtasks executed by cooperating threadblocks, warps, and threads. For example, a grid-level object may be constructed with base pointers to the start of a GEMM operation, add a threadblock-dependent offset to partition the problem, and then compute a per-threadblock GEMM. This in turn performs some operations as a collection of cooperating threads, while it may partition other parts of the task into warp-level subtasks. Consequently, CUTLASS components are organized by the computation then by the layer of the following hierarchy. * *device*: an operation is _device-wide_ and may launch one or more kernels on the GPU * *kernel*: an operation is implemented by a CUDA kernel with definitions for `__shared__` memory and constant memory allocations * *threadblock*: an operation is collectivey executed by a threadblock; any component calling `__syncthreads()` is likely to be threadblock-scope * *warp*: an operation is collectively executed by a warp; threads within the context of a warp are referred to as _lane_ * *thread*: an operation is performed by an individual thread with no other data sharing or interaction with other threads * *instruction*: an operation corresponds to an individual hardware or PTX instruction ## Design Patterns CUTLASS strives to achieve the highest performance possible on NVIDIA GPUs while also offering a flexible composition that an be easily applied to solve new problems related to Deep Learning and linear algebra. Though we intend to make CUTLASS as simple and straightforward as possible, given a tradeoff between simplicity and performance, CUTLASS chooses performance. Consequently, several design patterns are necessary to yield a composable structure while also satisfying these performance objectives. This section is intended to provide more detail. ### Templates CUDA C++ templates and modern generic programming techniques enable CUTLASS device code to span a large design space. This design space includes: * Mixed precision arithmetic and data storage * Kernels specialized for layout and problem size * Support for kernel fusion Moreover, templates provided a structured approach to collecting compile-time constants such as tile dimensions. These must be template arguments to target static array allocation and take advantage of loop unrolling, constant folding, and function inlining. ### Constant Memory Several CUTLASS template classes exhibit a pattern in which problem-specific internal state is known at kernel launch time and remains invariant throughout the execution of a kernel. For example, tile iterators compute several offsets based on the strides of the input tensor that is added to an internal pointer when loading the elements of a tile. These are computed from the tensor stride and never updated; the per-thread internal state consists only of the internal global memory pointer. CUTLASS can take advantage of this CUDA grid-invariant property by constructing the object in host code and passing a composed parameters structure to the kernel. This confers two benefits: (1.) invariant state is held in constant memory, and (2.) there is no overhead to compute the initial state by each thread. The design pattern in CUTLASS is for classes with nontrivial constructors to define `struct Params` as an inner class which contains grid-invariant state. These should define a constructor and an `initialize()` method. The `Params` structure should also include a data member corresponding to each data member in the parent class, so these too can be properly constructed in host code. The parent class should define a constructor which accepts `Params const &` as its first argument. ### Composable Shared Memory Shared memory requires explicit effort by the programmer to allocate and de-allocate. CUTLASS follows the paradigm introduced by [CUB](https://nvlabs.github.io/cub/) to define composed structures for storing data intended to be held in shared memory. Any object requiring shared memory storage for itself or its data members should define a child structure called `SharedStorage`. This holds data needed by the class and also instantiates `SharedStorage` objects for each data member. To be consistent, this pattern defines a convention in which classes define internal shared memory storage requirements. Classes should consider all SharedStorage structures to be opaque other than their own child class. When the lifetimes of child objects are known to be non-overlapping, unions may be used to alias multiple SharedStorage objects to the same shared memory region and reduce overall SMEM capacity. ### Loop Unrolling CUTLASS requires tiles of data to be stored in registers for high-bandwidth access. Simultaneously, high-throughput math instructions must be issued concurrently with memory instructions to hide latency with relatively few concurrent threads. These objectives are achieved by unrolling loops whose iteration counts are known at compile time. Consequently, most loops within the CUTLASS GEMM implementation are specified by constant values and template arguments. The CUDA compiler is able to unroll the loop bodies, map array elements to registers, and construct an efficient instruction schedule. All loops expected to be unrolled should be annotated with `CUTLASS_PRAGMA_UNROLL` to explicitly direct the compiler to unroll them. ```c++ int const kN = 8; Array x; // Array we would like to store in registers CUTLASS_PRAGMA_UNROLL // Directs the CUDA compiler to unroll this loop. for (int idx = 0; idx < kN; ++idx) { // Loop has constant number of iterations. x[i] = float(idx); // Indirect access by induction variable results in // direct register access. } ``` ## Style ### C++ Style CUTLASS source code follows the [Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html) with exceptions and extensions. Design choices should be consistent with the [CppCoreGuidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md) recommendations by Stroustrup and Sutter. ### CUDA Built-in Variables Avoid direct access to CUDA built-in variables `threadIdx`, `blockIdx`, `blockDim`, and `gridDim` within CUTLASS components except in special circumstances. Using built-in 'global' variables directly within resuable components necessitates that all components use them consistently which may not be possible if CUTLASS components are used in other contexts. Instead, components should accept a linear ID identifying threads, warps, and threadblocks from calling code. The top-level kernel may then decide how to map threads, warps, and blocks to the problem it is solving. ### Use CUTLASS Fundamental Types Use the [fundamental types](fundamental_types.md) defined in CUTLASS consistently. Doing so contributes to a framework of interoperable, consistent components. In particular, be sure to use: * [Numeric types](fundamental_types.md#numeric-types) to represent numeric data in host and device code * [Containers](fundamental_types.md#containers) to store data in register-backed arrays * [functional.h](fundamental_types.md#functional) to perform numeric operations in generic code * [Layouts](layout.md) to store stride and partially specialize template classes * [`TensorRef` and `TensorView`](layout.md#tensorref) to pass pointers and layout objects Avoid defining alternative implementations of the same functionality. Instead, prefer to enhance or extend additional components where it makes sense. ### Classes and Structs Type names use `CapitalLetters` except when implementations are a _perfect_ drop-in replacement for Standard Library components. Follow the [CppCoreGuidelines](https://github.com/isocpp/CppCoreGuidelines/blob/master/CppCoreGuidelines.md#Rc-struct) to decide whether to use `class` or `struct`. Namely, * use `class` when the object must maintain an invariant. Data members related to the invariant should be private. * use `struct` when the class has no invariant to maintain, and data members may vary arbitrarily. ### Class Members Methods and members are written using `snake_case`. Private data and function members have suffix `_`. ### Constant names CUTLASS makes extensive use of constants and compile-time evaluation. Constant variable names should have prefix `k` and use mixed case. True compile-time constsants should be defined as `constexpr` to enable dependent `constexpr` functions. CUTLASS uses ["East const"](http://slashslash.info/2018/02/a-foolish-consistency/) style, placing `constexpr` keyword after the type name. ```c++ float constexpr kPi = 3.14159f; ``` ### Class Member Order Members within classes and structures should be organized as follows: 1. Type and constant definitions 2. Data members 3. Constructors 4. Other methods This convention follows the [CUB library](https://nvlabs.github.io/cub/) and is also described by [Howard Hinnant](https://howardhinnant.github.io/classdecl.html). Unsurprisingly, it approximates the usual ordering of chapters in a typical Systems and Controls textbook. That is, (1.) identify relevant constants, (2.) define a state-space representation of the dynamical system under study (i.e. the data members), and (3.) devote subsequent chapters to definining dynamical behavior of the system (i.e. the methods). _Example_: ```c++ class A { public: // Type definitions protected: // protected Type definitions private: // private Type definitions public: // Data members protected: // protected data members private: // private data members public: // Methods protected: // protected methods private: // private methods }; ``` ### File Names Files should be named using `snake_case` with extension `.h` for header files, `.cu` for CUDA sources, and `.cpp` for C++ host-only source files. ### Use scoped enums Use scoped enums added in C++11 for enumerated types. Use capital letters for the enumerated type name and prefix `k` for enumerators like other constants. ```c++ enum class MatrixOperation { kNone, kTranspose, kConjugate, kHermitian }; ``` ### Namespaces Namespaces are all lower case. The top-level namespace is `cutlass::`. The second nested namespace refers top the general category of operation performed by its members, and the third nested namespace refers to the CUDA execution model scope (if applicable). The bodies of namespace definitions should not be intented, and comments on the closing brace are welcome. ```c++ namespace cutlass { namespace gemm { namespace warp { struct MmaTensorCore { }; } // namespace warp } // namespace gemm } // namespace cutlass ``` ### Macros Avoid defining macros except where preprocessing is obligatory. In particular, avoid using macros for constants. Several existing macros defined in `cutlass/cutlass.h` are useful for working around compiler-dependent behavior. Annotations for device code: * `CUTLASS_HOST_DEVICE` for functions running on the host and the device * `CUTLASS_DEVICE` for functions running on the device only Loop unrolling: * `CUTLASS_PRAGMA_UNROLL` for full unrolling of loops with constant trip counts * `CUTLASS_PRAGMA_NO_UNROLL` to prevent unrolling ### #pragma once Use `#pragma once` to guard all headers. ```c++ /*! */ #pragma once ... ``` ### Source Line Length Avoid lines longer than 100 characters. These typically wrap unfavorably when viewed in Github's pretty printer. # Copyright Copyright (c) 2017-2020, NVIDIA CORPORATION. All rights reserved. ``` Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ```