Cutlass
CUDA Templates for Linear Algebra Subroutines and Solvers
|
clear_accumulators.h | Defines abstractions for efficiently clearing accumulator tiles |
convert.h | Defines conversion operations among Fragments of different base type |
coord.h | A Coord is a coordinate of arbitrary rank into a tensor or matrix |
core_io.h | Helpers for printing cutlass/core objects |
cutlass.h | Basic include for CUTLASS macros |
cutlass_math.h | Math utilities |
debug.h | Debugging and logging functionality |
dgemm_traits.h | Defines structural traits of double-precision GEMM |
fragment.h | Defines Fragment, a statically-sized array for storing parts of matrices within a thread's registers |
fragment_load_store.h | Defines accessors for loading and storing fragments to memory efficiently |
fragment_multiply_add.h | Defines multiply-add operations on fragments within a thread |
gemm.h | Implements a software-pipelined efficient GEMM |
gemm_epilogue.h | Implements the epilogue phase of the GEMM kernel that efficiently updates global memory with the computed matrix product |
gemm_epilogue_traits.h | Defines structural properties of the GEMM epilogue |
gemm_global_stream.h | Implements efficient loading of the thread block-level tile from global memory and storing to shared memory |
gemm_global_tile.h | Defines iterators for efficiently loading and storing to global memory |
gemm_operand.h | Defines constant expressions for mapping GEMM problem size and strides onto pitch-linear memory |
gemm_shared_stream.h | Defines abstractions for managing loading and storing fragments to shared memory in the efficient GEMM pipeline |
gemm_shared_tile.h | Defines iterators for efficiently loading and storing tiles to and from shared memory |
gemm_traits.h | Defines structural properties of complete GEMM computation |
hgemm_global_tile.h | Tile traits used to construct global tile iterator for HGEMM. This is intended to partition the thread block-level tile into 2D subtiles loaded by the threads and facilitate memory accesses larger than 16 bits |
hgemm_multiply_add.h | Specialization implementing multiply-add operation on half-precision floating point fragments |
hgemm_swizzle.h | Transposes a tile of 16b elements. Used by HGEMM to construct a K-strided layout in shared memory for multiplicands |
hgemm_traits.h | Defies structural properties of half-precision GEMM computation |
identity_block_swizzle.h | Defies functors for mapping blockIdx to partitions of the GEMM computation |
igemm_epilogue.h | Defines the epilogue phase of the GEMM computation for IGEMM, supporting integer and floating-point output matrix formats |
igemm_global_tile.h | Implements tile iterators to partition the thread block tile into 2D subtiles and efficiently load each. Applies permute transformation to construct 'interleaved K-strided' data layout in which 4-element dot products from the same K index are arranged in consecutive locations within shared memory |
igemm_multiply_add.h | Implements matrix multiply accumulate operation of 8-bit integer data using DP4A instruction |
igemm_swizzle.h | Transposes a fragment of data containing packed 8-bit integer elements |
igemm_traits.h | Defies structural properties of mixed-precision integer GEMM. Multiplicands are assumed to be packed 8bit integers, accumulators are assumed to be 32b signed integers, and output formats vary |
iterator_access.h | Free functions for loading and storing to implementations of tile iteartor concepts |
linear_scaling.h | Implements the BLAS linear scaling function alpha*AB + beta*C |
load_store.h | Defines abstractions for efficiently loading and storing vectors to memory |
matrix_traits.h | Defines properties of matrices used to denote layout and operands to GEMM kernels |
platform.h | C++ features that may be otherwise unimplemented for CUDA device functions |
predicate_vector.h | Defines container classes and iterators for managing a statically sized vector of boolean predicates |
reshape_tile.h | Defines a type for restructuring a tile |
sgemm_traits.h | Defies structural properties of single-precision GEMM |
shape.h | Defines Shape implementing the Layout concept for representing a 4D hypercube of objects |
tensor_ref.h | Defines a structure containing strides, bounds, and a pointer to tensor data |
tensor_view.h | Defines a structure containing strides and a pointer to tensor data |
thread_multiply_add.h | Template implementing matrix multiply-add operations on fragments |
tile_iterator.h | Defines the Tile Traits concept and iterators for loading and storing to tiles efficiently |
tile_traits_standard.h | Defines tile traits for several tile partitioning arrangements of threads expected to achieve efficient streaming performance |
vector.h | Defines a 1D vector of elements held in the registers of each thread |
wmma_gemm_epilogue_traits.h | Defines structural properties of WMMA GEMM's epilogue phase |
wmma_gemm_global_tile.h | Defines tile iterator traits for loading thread block-level tile from global memory |
wmma_gemm_multiply_add.h | Implements warp-level matrix multiply-accumulate operation using CUDA WMMA API |
wmma_gemm_shared_tile.h | Defines iterator traits for efficiently loading and storing fragment to and from shared memory, specialized for WMMA GEMM |
wmma_gemm_traits.h | Defies structural properties of GEMM targeting WMMA API in CUDA |
wmma_matrix.h | Abstractions for loading and storing matrices using the CUDA WMMA API |