![ALT](/media/images/gemm-hierarchy-with-epilogue-no-labels.png "CUTLASS Code Organization") [README](/README.md#documentation) > **Code Organization** # CUTLASS Code Organization This document describes the layout of the CUTLASS repository. The main components are: * **CUTLASS Template Library** - CUDA Templates for Linear Algebra Subroutines and Solvers (header only) * **CUTLASS Utilities** - Additional templates * **CUTLASS Instance Library** - instantiations of CUTLASS templates covering the design space * **CUTLASS Profiler** - CUTLASS Library, Profiler, and Utilities * **Examples** - SDK examples of CUTLASS Template Library and components * **Media** - supporting documentation and media content * **Tests** - test components for CUTLASS Template Library and tools ## CUTLASS Template Library CUDA Templates for Linear Algebra Subroutines and Solvers is a library of CUDA C++ template classes for performing efficient matrix computations on NVIDIA GPUs. Like NVIDIA CUB, the components of CUTLASS are organized hierarchically based on the scope of cooperative elements. For example, warp-level GEMM components perform a matrix multiply collectively by the set of threads within a warp. The following figure illustrates each layer. Components are designed to be usable by client applications accessing functionailty at each scope. CUTLASS Templates are implemented by header files in the following directory structure: ``` include/ # Top-level include directory. Client applications should target this path. cutlass/ # CUDA Templates for Linear Algebra Subroutines and Solvers - headers only arch/ # direct exposure of architecture features (including instruction-level GEMMs) * gemm/ # code specialized for general matrix product computations thread/ # thread-level operators warp/ # warp-level operators threadblock/ # CTA-level operators kernel/ # CUDA kernel entry points device/ # launches kernel(s) over a full device * # scope-agnostic components and basic vocabular type definitions for GEMM layout/ # layout definitions for matrices, tensors, and other mathematical objects in memory * reduction/ # bandwidth-limited reduction kernels that do not fit the "gemm" models thread/ # thread-level operators warp/ # warp-level operators threadblock/ # CTA-level operators kernel/ # CUDA kernel entry points device/ # launches kernel(s) over a full device * # scope-agnostic components and basic vocabular type definitions transform/ # code specialized for layout, type, and domain transformations thread/ # thread-level operators warp/ # warp-level operators threadblock/ # CTA-level operators kernel/ # CUDA kernel entry points device/ # launches kernel(s) over a full device * # scope-agnostic components and basic vocabulary type definitions util/ # miscellaneous CUTLASS components * * # core vocabulary types and fundamental arithmetic operators ``` See [Programming Guidelines](/media/docs/programming_guidelines.md) for further details about conventions and design patterns used throughout CUTLASS. ## Tools The `tools/` directory contains clients of the CUTLASS Template library and includes the following. ## CUTLASS Instance Library The CUTLASS Instance Library contains instantiations of the above CUTLASS templates covering supported configurations, data types, block structure, and tile sizes. These instantiations are procedurally generated using a set of scripts to span the design space. ``` tools/ library/ # static/dynamic library containing all kernel instantiations of interest # (with some build-level filter switches to compile specific subsets, perhaps by architecture) include/ cutlass/ library/ # header files for CUTLASS Deliverables Library (in cutlass::library:: namespace) library.h # defines enums and structs to describe the tiled structure of operator instances manifest.h # collection of all instances scripts/ # scripts to procedurally generate CUTLASS template instances gemm_operations.py library.py generator.py # entry point of procedural generation scripts - invoked by cmake manifest.py src/ ``` ## Examples To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`. When CMake is executed, the CUTLASS Instance Library generator scripts are executed to construct a set of instantiations in `build/tools/library/generated/`. The CUTLASS Profiler is designed to initialize the CUTLASS Instance Library and execute all operations contained therein. ### CUTLASS Utilities `tools/util/` defines a companion library of headers and sources that support the CUTLASS test programs, examples, and other client applications. Its structure is as follows: ``` tools/ util/ include/ cutlass/ util/ # CUTLASS Utility companion library reference/ # reference implementation of CUTLASS operators - minimal consideration for performance detail/ * device/ # device-side reference implementations of CUTLASS operators thread/ kernel/ * host/ # host-side reference implementations of CUTLASS operators * * ``` [More details about CUTLASS Utilities may be found here.](media/docs/utilities.md) ### CUTLASS Profiler This is application constructs an execution environment for evaluating the functionality and performance of CUTLASS components. It is implemented in ``` tools/ profiler/ ``` and may be built as follows. ``` $ make cutlass_profiler -j ``` [Further details about the CUTLASS Profiler are described here.](media/docs/profiler.md) ## Examples To demonstrate CUTLASS components, several SDK examples are implemented in `examples/`. ## Media This directory contains documentation, images, and performance result data which accompanies the CUTLASS library and components. ## Tests Test programs for CUTLASS. Tests are organized hierarchically, mirroring the organization of source files. ``` test/ # unit tests for CUTLASS Template Library unit/ arch/ core/ gemm/ device/ kernel/ thread/ threadblock/ warp/ reduction/ kernel/ thread/ transform/ threadblock/ * ``` Tests can be built and run at the top level scope by invoking `make test_unit` or by building and explicitly executing each individual target, e.g. `cutlass_test_unit_gemm_device`. Tests are configured to specify appropriate GTest filter strings to avoid running except on architectures where they are expected to pass. Thus, no tests should fail. The actual number of tests run may vary over time as more are added. # Copyright Copyright (c) 2017-2019, NVIDIA CORPORATION. All rights reserved. ``` Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the NVIDIA CORPORATION nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TOR (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ```