Here is a list of all files with brief descriptions:

clear_accumulators.h	Defines abstractions for efficiently clearing accumulator tiles
convert.h	Defines conversion operations among Fragments of different base type
coord.h	A Coord is a coordinate of arbitrary rank into a tensor or matrix
core_io.h	Helpers for printing cutlass/core objects
cutlass.h	Basic include for CUTLASS macros
cutlass_math.h	Math utilities
debug.h	Debugging and logging functionality
dgemm_traits.h	Defines structural traits of double-precision GEMM
fragment.h	Defines Fragment, a statically-sized array for storing parts of matrices within a thread's registers
fragment_load_store.h	Defines accessors for loading and storing fragments to memory efficiently
fragment_multiply_add.h	Defines multiply-add operations on fragments within a thread
gemm.h	Implements a software-pipelined efficient GEMM
gemm_epilogue.h	Implements the epilogue phase of the GEMM kernel that efficiently updates global memory with the computed matrix product
gemm_epilogue_traits.h	Defines structural properties of the GEMM epilogue
gemm_global_stream.h	Implements efficient loading of the thread block-level tile from global memory and storing to shared memory
gemm_global_tile.h	Defines iterators for efficiently loading and storing to global memory
gemm_operand.h	Defines constant expressions for mapping GEMM problem size and strides onto pitch-linear memory
gemm_shared_stream.h	Defines abstractions for managing loading and storing fragments to shared memory in the efficient GEMM pipeline
gemm_shared_tile.h	Defines iterators for efficiently loading and storing tiles to and from shared memory
gemm_traits.h	Defines structural properties of complete GEMM computation
hgemm_global_tile.h	Tile traits used to construct global tile iterator for HGEMM. This is intended to partition the thread block-level tile into 2D subtiles loaded by the threads and facilitate memory accesses larger than 16 bits
hgemm_multiply_add.h	Specialization implementing multiply-add operation on half-precision floating point fragments
hgemm_swizzle.h	Transposes a tile of 16b elements. Used by HGEMM to construct a K-strided layout in shared memory for multiplicands
hgemm_traits.h	Defies structural properties of half-precision GEMM computation
identity_block_swizzle.h	Defies functors for mapping blockIdx to partitions of the GEMM computation
igemm_epilogue.h	Defines the epilogue phase of the GEMM computation for IGEMM, supporting integer and floating-point output matrix formats
igemm_global_tile.h	Implements tile iterators to partition the thread block tile into 2D subtiles and efficiently load each. Applies permute transformation to construct 'interleaved K-strided' data layout in which 4-element dot products from the same K index are arranged in consecutive locations within shared memory
igemm_multiply_add.h	Implements matrix multiply accumulate operation of 8-bit integer data using DP4A instruction
igemm_swizzle.h	Transposes a fragment of data containing packed 8-bit integer elements
igemm_traits.h	Defies structural properties of mixed-precision integer GEMM. Multiplicands are assumed to be packed 8bit integers, accumulators are assumed to be 32b signed integers, and output formats vary
iterator_access.h	Free functions for loading and storing to implementations of tile iteartor concepts
linear_scaling.h	Implements the BLAS linear scaling function alphaAB + betaC
load_store.h	Defines abstractions for efficiently loading and storing vectors to memory
matrix_traits.h	Defines properties of matrices used to denote layout and operands to GEMM kernels
platform.h	C++ features that may be otherwise unimplemented for CUDA device functions
predicate_vector.h	Defines container classes and iterators for managing a statically sized vector of boolean predicates
reshape_tile.h	Defines a type for restructuring a tile
sgemm_traits.h	Defies structural properties of single-precision GEMM
shape.h	Defines Shape implementing the Layout concept for representing a 4D hypercube of objects
tensor_ref.h	Defines a structure containing strides, bounds, and a pointer to tensor data
tensor_view.h	Defines a structure containing strides and a pointer to tensor data
thread_multiply_add.h	Template implementing matrix multiply-add operations on fragments
tile_iterator.h	Defines the Tile Traits concept and iterators for loading and storing to tiles efficiently
tile_traits_standard.h	Defines tile traits for several tile partitioning arrangements of threads expected to achieve efficient streaming performance
vector.h	Defines a 1D vector of elements held in the registers of each thread
wmma_gemm_epilogue_traits.h	Defines structural properties of WMMA GEMM's epilogue phase
wmma_gemm_global_tile.h	Defines tile iterator traits for loading thread block-level tile from global memory
wmma_gemm_multiply_add.h	Implements warp-level matrix multiply-accumulate operation using CUDA WMMA API
wmma_gemm_shared_tile.h	Defines iterator traits for efficiently loading and storing fragment to and from shared memory, specialized for WMMA GEMM
wmma_gemm_traits.h	Defies structural properties of GEMM targeting WMMA API in CUDA
wmma_matrix.h	Abstractions for loading and storing matrices using the CUDA WMMA API