574 lines
		
	
	
		
			21 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			574 lines
		
	
	
		
			21 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| 
 | |
| 
 | |
| [README](/README.md#documentation) > **CUTLASS GEMM API**
 | |
| 
 | |
| # CUTLASS GEMM API
 | |
| 
 | |
| CUTLASS presents a uniform programming model for matrix multiply-accumulate operations at each level of the hierarchy. This document
 | |
| focuses on device-level, threadblock-level GEMMs, warp-level GEMMs, thread-level GEMMs, and instruction-level GEMMs.
 | |
| 
 | |
| # CUTLASS GEMM Model
 | |
| 
 | |
| CUTLASS implements the basic GEMM triple loop nest with a tiled structure mirroring the execution model hierarchy.
 | |
| 
 | |
| The following pseudocode describes the model for a GEMM kernel targeting a warp-synchronous matrix multiply instruction like
 | |
| mma.sync. The entire operation is referred to as "Gemm," as it is assumed that an epilogue operation performs the general matrix
 | |
| update similar to BLAS.
 | |
| 
 | |
| ```c++
 | |
|                                                                             // cutlass::gemm::device::Gemm
 | |
|                                                                             //
 | |
| for (int cta_n = 0; cta_n < GemmN; cta_n += CtaTileN) {                     // for each CTA       } CTA-level concurrency
 | |
|   for (int cta_m = 0; cta_m < GemmM; cta_m += CtaTileM) {                   //    for each CTA    }
 | |
|                                                                             //    
 | |
|                                                                             // cutlass::gemm::threadblock::Mma
 | |
|                                                                             //
 | |
|     for (int cta_k = 0; cta_k < GemmK; cta_k += CtaTileK) {                 //       "GEMM mainloop" - no unrolling - one iteration of this loop is one "stage"
 | |
|                                                                             //
 | |
|       for (int warp_n = 0; warp_n < CtaTileN; warp_n += WarpTileN) {        // for each warp      } warp-level concurrency
 | |
|         for (int warp_m = 0; warp_m < CtaTileM; warp_m += WarpTileM) {      //    for each warp   }
 | |
|                                                                             //
 | |
|           for (int warp_k = 0; warp_k < CtaTileK; warp_k += WarpTileK) {    //       fully unroll across CtaTileK - one iteration of this loop is one "k Group"
 | |
|                                                                             //
 | |
|             for (int mma_k = 0; mma_k < WarpTileK; mma_k += MmaK) {         // cutlass::gemm::warp::Mma
 | |
|               for (int mma_n = 0; mma_n < WarpTileN; mma_n += MmaN) {       //
 | |
|                 for (int mma_m = 0; mma_m < WarpTileM; mma_m += MmaM) {     //
 | |
|                                                                             //
 | |
|                   mma_instruction(d, a, b, c);                              // cutlass::arch::mma - warp-wide matrix multiply instruction
 | |
| 
 | |
|                 }   // for mma_m
 | |
|               }   // for mma_n
 | |
|             }   // for mma_k
 | |
| 
 | |
|           }   // for warp_k
 | |
|         }   // for warp_m
 | |
|       }   // for warp_n
 | |
| 
 | |
|     }   // for cta_k
 | |
|   }   // for cta_m
 | |
| }   // for cta_n
 | |
| 
 | |
| ```
 | |
| 
 | |
| The outer-most loops correspond to CTA-level hardware concurrency and are not explicitly written as loops in the code. These
 | |
| are implied by CUDA grid launch semantics.
 | |
| 
 | |
| The comment `cutlass::gemm::threadblock::Mma` refers to the threadblock-scoped matrix multiply-accumulate concept. This is
 | |
| the computation performed by one threadblock to compute a matrix product in registers. The "GEMM main loop" is listed.
 | |
| 
 | |
| The comment `cutlass::gemm::warp::Mma` refers to the computation performed by each warp. This is a nested loop executing a
 | |
| sequence of accumulated outer products. 
 | |
| 
 | |
| The inner-most operation corresponds directly to hardware support. In this example, the nested structure terminates with
 | |
| warp-synchronous matrix multiply instructions targeting Tensor Cores. 
 | |
| Alternatively, GEMMs targeting single-thread instructions may have an additional series of nested loops corresponding to 
 | |
| thread-level concurrency.
 | |
| 
 | |
| # CUTLASS GEMM Components
 | |
| 
 | |
| This loop nest is expressed in CUTLASS via the following components which are specialized for data type, layout, and
 | |
| math instruction.
 | |
| 
 | |
| 
 | |
| 
 | |
| These components are described in the following sections.
 | |
| 
 | |
| ## Device-wide GEMM API
 | |
| 
 | |
| The device-level GEMM API is intended to streamline instantiation and execution of the standard
 | |
| GEMM computation across the GPU. This operator is intended to be used in host-side .cu code and
 | |
| has semantics similar to cuBLAS.
 | |
| 
 | |
| The device-wide GEMM API is embodied by the following operators:
 | |
| - [cutlass::gemm::device::Gemm](/include/cutlass/gemm/device/gemm.h) - basic GEMM operation
 | |
| - [cutlass::gemm::device::GemmArray](/include/cutlass/gemm/device/gemm_array.h) - batched GEMM operation in which input matrices are read from arrays of pointers
 | |
| - [cutlass::gemm::device::GemmBatched](/include/cutlass/gemm/device/gemm_batched.h) - batched GEMM operation in which input matrices are separated by a constant stride
 | |
| - [cutlass::gemm::device::GemmSplitKParallel](/include/cutlass/gemm/device/gemm_splitk_parallel.h) - GEMM operation that partitions the GEMM K dimension then launches a separate reduction kernel
 | |
| 
 | |
| **Example:** launch a mixed-precision GEMM targeting Volta Tensor Cores.
 | |
| ```c++
 | |
|   using Gemm = cutlass::gemm::device::Gemm<
 | |
|     cutlass::half_t,                           // ElementA
 | |
|     cutlass::layout::ColumnMajor,              // LayoutA
 | |
|     cutlass::half_t,                           // ElementB
 | |
|     cutlass::layout::ColumnMajor,              // LayoutB
 | |
|     cutlass::half_t,                           // ElementOutput
 | |
|     cutlass::layout::ColumnMajor,              // LayoutOutput
 | |
|     float,                                     // ElementAccumulator
 | |
|     cutlass::arch::OpClassTensorOp,            // tag indicating Tensor Cores
 | |
|     cutlass::arch::Sm70                        // tag indicating target GPU compute architecture
 | |
|   >;
 | |
| 
 | |
|   Gemm gemm_op;
 | |
|   cutlass::Status status;
 | |
|  
 | |
|   //
 | |
|   // Launch GEMM on the device
 | |
|   //
 | |
|  
 | |
|   status = gemm_op({
 | |
|     {m, n, k},
 | |
|     {ptrA, lda},
 | |
|     {ptrB, ldb},
 | |
|     {ptrC, ldc},
 | |
|     {ptrD, ldd},
 | |
|     {alpha, beta}
 | |
|   });
 | |
| 
 | |
|   if (status != cutlass::Status::kSuccess) {
 | |
|     return -1;
 | |
|   }
 | |
| ```
 | |
| 
 | |
| 
 | |
| ## Threadblock-level GEMM API
 | |
| 
 | |
| GEMMs at this scope are expected to efficiently load tiles of data from global memory into internal storage and then compute matrix
 | |
| products with warp-level GEMM operators.
 | |
| 
 | |
| The threadblock-scoped matrix multiply operation is embodied by 
 | |
| [cutlass::gemm::threadblock::MmaPipelined](/include/cutlass/gemm/threadblock/mma_pipelined.h).
 | |
| This is a class inspired by [std::transform_reduce()](https://en.cppreference.com/w/cpp/algorithm/transform_reduce) 
 | |
| which computes the accumulated matrix product of a range of tiles defined by tile iterators.
 | |
| 
 | |
| 
 | |
| 
 | |
| In the case of GEMM, the tile iterators are 
 | |
| [cutlass::transform::threadblock::PredicatedTileIterator](/include/cutlass/transform/threadblock/predicated_tile_iterator.h)
 | |
| to traverse a sequence of tiles in global memory with appropriate predication to avoid out-of-bounds
 | |
| memory accesses.
 | |
| 
 | |
| *Concept.* Threadblock-level matrix multiply accumulate operators are function objects satisfying the following concept.
 | |
| ```c++
 | |
| struct Mma {
 | |
|   /// Shape of warp-level matrix operation (concept: GemmShape)
 | |
|   struct Shape;
 | |
| 
 | |
|   /// Data type of multiplicand A (concept: numeric type)
 | |
|   struct ElementA;
 | |
| 
 | |
|   /// Layout of multiplicand A (concept: Layout)
 | |
|   struct LayoutA;
 | |
| 
 | |
|   /// Data type of multiplicand B (concept: numeric type)
 | |
|   struct ElementB;
 | |
| 
 | |
|   /// Layout of multiplicand B (concept: Layout)
 | |
|   struct LayoutB;
 | |
| 
 | |
|   /// Data type of accumulator matrix C (concept: numeric type)
 | |
|   struct ElementC;
 | |
| 
 | |
|   /// Layout of accumulator matrix C (concept: Layout)
 | |
|   struct LayoutC;
 | |
| 
 | |
|   /// Iterator of A operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept
 | |
|   struct IteratorA;
 | |
| 
 | |
|   /// Fragment object loaded from IteratorA (concept: Array<ElementA, ..>)
 | |
|   struct FragmentA;
 | |
| 
 | |
|   /// Iterator of B operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept
 | |
|   struct IteratorB;
 | |
| 
 | |
|   /// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)
 | |
|   struct FragmentB;
 | |
| 
 | |
|   /// Iterator of C operand in shared memory - 
 | |
|   ///    satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
 | |
|   struct IteratorC;
 | |
| 
 | |
|   /// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)
 | |
|   struct FragmentC;
 | |
| 
 | |
|   /// Warp-level matrix multiply operator (concept: satisfies gemm::warp::Mma)
 | |
|   struct Operator;
 | |
| 
 | |
|   //
 | |
|   // Method
 | |
|   //
 | |
| 
 | |
|   /// Computes a matrix product accumulated in D
 | |
|   CUTLASS_DEVICE
 | |
|   void operator()(
 | |
|     FragmentC &D, 
 | |
|     IteratorA iter_A, 
 | |
|     IteratorB iter_B, 
 | |
|     FragmentC const &C);
 | |
| };
 | |
| ```
 | |
| 
 | |
| ## Warp-level Matrix Multiply API
 | |
| 
 | |
| Warp-level GEMM operators load tiles from shared memory into registers and then compute matrix multiplies using either 
 | |
| Tensor Cores or CUDA Cores. The result is accumulated in a register tile. Iterators are defined for each
 | |
| operand `A`, `B`, and `C`.
 | |
| 
 | |
| The warp-level GEMM API is a generalization of CUDA's WMMA API to achieve the following objectives:
 | |
| 
 | |
| - native matrix multiply sizes of Tensor Cores
 | |
| - permuted shared memory layouts to ensure conflict-free accesses
 | |
| - pointer initilization outside of the mainloop
 | |
| - efficient traversal
 | |
| 
 | |
| Defining a warp-level matrix multiply in CUTLASS is similar to WMMA as shown below.
 | |
| 
 | |
| 
 | |
| 
 | |
| The usage model is also similar. The following example computes a warp-level GEMM operation,
 | |
| accumulating a series of matrix products in a register-backed array. The input to a warp-level
 | |
| GEMM operation in CUTLASS _must_ be data in shared memory loaded by iterators or on 
 | |
| register-backed fragments.
 | |
| 
 | |
| 
 | |
| 
 | |
| ```c++
 | |
| #include "cutlass/gemm/warp/default_mma_tensor_op.h"
 | |
| 
 | |
| using LayoutA = cutlass::layout::ColumnMajorTensorOpMultiplicandCongruous<
 | |
|     cutlass::sizeof_bits<Element>::value, 64>;
 | |
| 
 | |
| using LayoutB = cutlass::layout::RowMajorTensorOpMultiplicandCongruous<
 | |
|     cutlass::sizeof_bits<Element>::value, 64>;
 | |
| 
 | |
| using WarpMma = typename cutlass::gemm::warp::DefaultMmaTensorOp<
 | |
|     cutlass::gemm::GemmShape<64, 64, 8>,                            // Overall warp-level GEMM operation
 | |
|     cutlass::gemm::GemmShape<16, 8, 8>,                             // Target instruction
 | |
|     cutlass::half_t, LayoutA,                                       // operand A type and layout
 | |
|     cutlass::half_t, LayoutB,                                       // operand B type and layout
 | |
|     float,                                                          // accumulator type
 | |
|     cutlass::layout::RowMajor>::Type;                               // accumulator layout
 | |
| 
 | |
| //
 | |
| // Define a GEMM operation loading data from shared memory
 | |
| //
 | |
| int const kGemmK = 32;
 | |
| 
 | |
| __shared__ ElementA smem_buffer_A[WarpMma::Shape::kM * kGemmK];
 | |
| __shared__ ElementB smem_buffer_B[WarpMma::Shape::kN * kGemmK];
 | |
| 
 | |
| //
 | |
| // Construct iterators into SMEM tiles
 | |
| //
 | |
| 
 | |
| // leading dimensions inferred from matrix problem size
 | |
| int lda = WarpMma::Shape::kM;
 | |
| int ldb = WarpMma::Shape::kN;
 | |
| 
 | |
| // iterators into shared memory
 | |
| WarpMma::IteratorA warp_iterator_A({smem_buffer_A, lda});
 | |
| WarpMma::IteratorB warp_iterator_B({smem_buffer_B, ldb});
 | |
| 
 | |
| // Fragments in registers storing the operands
 | |
| FragmentA frag_A;
 | |
| FragmentB frag_B;
 | |
| FragmentC accum;
 | |
| 
 | |
| WarpMma mma;
 | |
| 
 | |
| accum.clear();
 | |
| 
 | |
| //
 | |
| // Accumulated outer product
 | |
| //
 | |
| 
 | |
| #pragma unroll 1
 | |
| for (int k = 0; k < kGemmK; k += WarpMma::Shape::kK) {
 | |
| 
 | |
|   
 | |
|   iter_A.load(frag_A);  // Load fragments from A and B matrices
 | |
|   iter_B.load(frag_B);
 | |
| 
 | |
|   ++iter_A; ++iter_B;   // Advance along GEMM K to next tile in A
 | |
|                         //   and B matrices
 | |
| 
 | |
|                         // Compute matrix product
 | |
|   mma(accum, frag_A, frag_B, accum);
 | |
| }
 | |
| ```
 | |
| 
 | |
| *Concept.* Warp-level Mma operations are function objects satisfying the following concept.
 | |
| 
 | |
| ```c++
 | |
| struct Mma {
 | |
|   /// Shape of warp-level matrix operation (concept: GemmShape)
 | |
|   struct Shape;
 | |
| 
 | |
|   /// Data type of multiplicand A (concept: numeric type)
 | |
|   struct ElementA;
 | |
| 
 | |
|   /// Layout of multiplicand A (concept: Layout)
 | |
|   struct LayoutA;
 | |
| 
 | |
|   /// Data type of multiplicand B (concept: numeric type)
 | |
|   struct ElementB;
 | |
| 
 | |
|   /// Layout of multiplicand B (concept: Layout)
 | |
|   struct LayoutB;
 | |
| 
 | |
|   /// Data type of accumulator matrix C (concept: numeric type)
 | |
|   struct ElementC;
 | |
| 
 | |
|   /// Layout of accumulator matrix C (concept: Layout)
 | |
|   struct LayoutC;
 | |
| 
 | |
|   /// Iterator of A operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept
 | |
|   struct IteratorA;
 | |
| 
 | |
|   /// Fragment object loaded from IteratorA (concept: Array<ElementA, ..>)
 | |
|   struct FragmentA;
 | |
| 
 | |
|   /// Iterator of B operand in shared memory - satisfies: ReadableRandomAccessTileIteratorConcept
 | |
|   struct IteratorB;
 | |
| 
 | |
|   /// Fragment object loaded from IteratorB (concept: Array<ElementB, ..>)
 | |
|   struct FragmentB;
 | |
| 
 | |
|   /// Iterator of C operand in shared memory - 
 | |
|   ///     satisfies: ReadableRandomAccessTileIteratorConcept | WriteableRandomAccessTileIteratorConcept
 | |
|   struct IteratorC;
 | |
| 
 | |
|   /// Fragment object loaded from IteratorC (concept: Array<ElementC, ..>)
 | |
|   struct FragmentC;
 | |
| 
 | |
|   /// Indicates class of matrix operator (arch::OpClassSimt or arch::OpClassTensorOp)
 | |
|   struct OperatorClass;
 | |
| 
 | |
|   //
 | |
|   // Methods
 | |
|   //
 | |
| 
 | |
|   /// Computes a matrix multiply-accumulate
 | |
|   CUTLASS_DEVICE
 | |
|   void operator()(
 | |
|     FragmentC &D, 
 | |
|     IteratorA A, 
 | |
|     IteratorB B, 
 | |
|     FragmentC const &C);
 | |
| };
 | |
| ```
 | |
| 
 | |
| 
 | |
| 
 | |
| *Tensor Core Operators.* Warp-level matrix multiply operators targeting Tensor Cores
 | |
| may be defined with the following template arguments. The `Policy` type specifies implementation-level details which may 
 | |
| be used to affect performance or internal implementation of the warp-level operator.
 | |
| 
 | |
| ```c++
 | |
| namespace cutlass {
 | |
| namespace gemm {
 | |
| namespace warp {
 | |
| 
 | |
| /// Structure to compute the matrix product targeting CUDA cores and SIMT math instructions.
 | |
| template <
 | |
|   /// Size of the Gemm problem - concept: gemm::GemmShape<>
 | |
|   typename Shape_,
 | |
|   /// Data type of A elements
 | |
|   typename ElementA_,
 | |
|   /// Layout of A matrix (concept: MatrixLayout)
 | |
|   typename LayoutA_,
 | |
|   /// Data type of B elements
 | |
|   typename ElementB_,
 | |
|   /// Layout of B matrix (concept: MatrixLayout)
 | |
|   typename LayoutB_,
 | |
|   /// Element type of C matrix
 | |
|   typename ElementC_,
 | |
|   /// Layout of C matrix (concept: MatrixLayout)
 | |
|   typename LayoutC_,
 | |
|   /// Shape of the warp in units of thread (concept: MmaSimtPolicy)
 | |
|   typename Policy_,
 | |
|   /// Used for partial specialization
 | |
|   typename Enable = bool
 | |
| >
 | |
| class MmaTensorOp {}
 | |
| 
 | |
| } // namespace warp
 | |
| } // namespace gemm
 | |
| } // namespace cutlass
 | |
| 
 | |
| ```
 | |
| 
 | |
| *SIMT Math Instructions.*  Warp-level matrix multiply operators targeting CUDA Cores
 | |
| may be defined with the following template arguments. The `Policy` type specifies implementation-level details which may 
 | |
| be used to affect performance or internal implementation of the warp-level operator.
 | |
| 
 | |
| ```c++
 | |
| /// Structure to compute the matrix product targeting CUDA cores and SIMT math instructions.
 | |
| template <
 | |
|   /// Size of the Gemm problem - concept: gemm::GemmShape<>
 | |
|   typename Shape_,
 | |
|   /// Data type of A elements
 | |
|   typename ElementA_,
 | |
|   /// Layout of A matrix (concept: MatrixLayout)
 | |
|   typename LayoutA_,
 | |
|   /// Data type of B elements
 | |
|   typename ElementB_,
 | |
|   /// Layout of B matrix (concept: MatrixLayout)
 | |
|   typename LayoutB_,
 | |
|   /// Element type of C matrix
 | |
|   typename ElementC_,
 | |
|   /// Layout of C matrix (concept: MatrixLayout)
 | |
|   typename LayoutC_,
 | |
|   /// Shape of the warp in units of thread (concept: MmaSimtPolicy)
 | |
|   typename Policy_,
 | |
|   /// Used for partial specialization
 | |
|   typename Enable = bool
 | |
| >
 | |
| class MmaSimt;
 | |
| ```
 | |
| 
 | |
| 
 | |
| ## Thread-level GEMM API
 | |
| 
 | |
| Thread-level GEMM operations perform matrix multiply-accumulate on data held in registers. These target CUDA Cores exclusively.
 | |
| 
 | |
| *Concept.* Thread-level matrix multiply operations are function objects satisfying the following concept.
 | |
| ```c++
 | |
| struct Mma {
 | |
| 
 | |
|   /// Shape of warp-level matrix operation (concept: GemmShape)
 | |
|   struct Shape;
 | |
| 
 | |
|   /// Data type of multiplicand A (concept: numeric type)
 | |
|   struct ElementA;
 | |
| 
 | |
|   /// Layout of multiplicand A (concept: Layout)
 | |
|   struct LayoutA;
 | |
| 
 | |
|   /// Fragment object loaded from IteratorA (concept: Array<ElementA, ..>)
 | |
|   struct FragmentA;
 | |
| 
 | |
|   /// Data type of multiplicand B (concept: numeric type)
 | |
|   struct ElementB;
 | |
| 
 | |
|   /// Layout of multiplicand B (concept: Layout)
 | |
|   struct LayoutB;
 | |
| 
 | |
|   /// Fragment object loaded from IteratorA (concept: Array<ElementB, ..>)
 | |
|   struct FragmentB;
 | |
| 
 | |
|   /// Data type of accumulator matrix C (concept: numeric type)
 | |
|   struct ElementC;
 | |
| 
 | |
|   /// Layout of accumulator matrix C (concept: Layout)
 | |
|   struct LayoutC;
 | |
| 
 | |
|   /// Fragment object loaded from IteratorA (concept: Array<ElementC, ..>)
 | |
|   struct FragmentC;
 | |
| 
 | |
|   //
 | |
|   // Methods
 | |
|   //
 | |
| 
 | |
|   /// Computes a matrix multiply-accumulate
 | |
|   CUTLASS_DEVICE
 | |
|   void operator()(
 | |
|     FragmentC &D, 
 | |
|     FragmentA const &A, 
 | |
|     FragmentB const &B, 
 | |
|     FragmentC const &C);
 | |
| };
 | |
| ```
 | |
| 
 | |
| The CUTLASS thread-level GEMM template accepts the following template arguments.
 | |
| ```c++
 | |
| namespace cutlass {
 | |
| namespace gemm {
 | |
| namespace thread {
 | |
| 
 | |
| /// Structure to compute the matrix product
 | |
| template <
 | |
|   /// Size of the Gemm problem - concept: gemm::GemmShape<>
 | |
|   typename Shape,
 | |
|   /// Data type of A elements
 | |
|   typename ElementA,
 | |
|   /// Layout of A matrix (concept: MatrixLayout)
 | |
|   typename LayoutA,
 | |
|   /// Data type of B elements
 | |
|   typename ElementB,
 | |
|   /// Layout of B matrix (concept: MatrixLayout)
 | |
|   typename LayoutB,
 | |
|   /// Element type of C matrix
 | |
|   typename ElementC,
 | |
|   /// Layout of C matrix (concept: MatrixLayout)
 | |
|   typename LayoutC,
 | |
|   /// Concept: arch::OpMultiplyAdd or arch::Mma<>
 | |
|   typename Operator = arch::OpMultiplyAdd,
 | |
|   /// Used for partial specialization
 | |
|   typename Enable = bool
 | |
| >
 | |
| struct Mma;
 | |
| 
 | |
| } // namespace thread
 | |
| } // namespace gemm
 | |
| } // namespace cutlass
 | |
| ```
 | |
| 
 | |
| ## Efficient Epilogue 
 | |
| 
 | |
| CUTLASS GEMM operators perform mma followed by epilogue operation similar 
 | |
| to cuBLAS. CUTLASS implements an efficient row-major epilogue. Thus, to achieve 
 | |
| column-major GEMM, operands A & B are transposed and swapped.
 | |
| 
 | |
| To enable efficient row-major epilogue for both row-major and column-major output layout, 
 | |
| CUTLASS' device-level GEMM operators `cutlass::device::Gemm` and `cutlass::device::GemmUniversal` 
 | |
| provide two template definitions:
 | |
| - (a) [General definition](/include/cutlass/gemm/device/gemm.h#L217)
 | |
| - (b) [Specialized definition for column-major source/output](/include/cutlass/gemm/device/gemm.h#L545)
 | |
| 
 | |
| Efficient row-major epilogue for:
 | |
| - (i)  GEMM operator on row-major source/output uses template (a). It runs row-major GEMM and 
 | |
| an efficient row-major epilogue.
 | |
| - (ii)  GEMM operator on column-major source/output uses template (b). It transposes and swaps 
 | |
| operands A and B to enable efficient epilogue. `A x B = C => Transpose(B) x Transpose(A) = Transpose(C)`.
 | |
| For column-major source (C) matrix, Transpose(C) is row-major, and efficient epilogue works on 
 | |
| row-major.
 | |
| 
 | |
| Note that cuBLAS typically expects a column-major source (C) and output matrix (D). Thus,
 | |
| CUTLASS library only instantiates and generates GEMM operatos with column-major layout. However, 
 | |
| CUTLASS by itself can run both row-major and column-major output layouts for all combinations 
 | |
| of input layouts. Thus, CUTLASS supports the following layout combinations for input and output layouts: 
 | |
| 
 | |
| - `{N,T} x {N,T} => {N,T}` - NN, TN, TN, TT GEMM for both row-major and column-major output
 | |
| 
 | |
| ## Instruction-level operations
 | |
| 
 | |
| CUTLASS defines a template-based interface to Tensor Core operations to avoid resorting
 | |
| to inline PTX.
 | |
| 
 | |
| - [mma_sm70.h](/include/cutlass/arch/mma_sm70.h) - Volta TensorCore operations
 | |
| - [mma_sm75.h](/include/cutlass/arch/mma_sm75.h) - Turing TensorCore operations
 | |
| 
 | |
| 
 | |
| # Copyright
 | |
| 
 | |
| Copyright (c) 2017 - 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 | |
| SPDX-License-Identifier: BSD-3-Clause
 | |
| 
 | |
| ```
 | |
|   Redistribution and use in source and binary forms, with or without
 | |
|   modification, are permitted provided that the following conditions are met:
 | |
| 
 | |
|   1. Redistributions of source code must retain the above copyright notice, this
 | |
|   list of conditions and the following disclaimer.
 | |
| 
 | |
|   2. Redistributions in binary form must reproduce the above copyright notice,
 | |
|   this list of conditions and the following disclaimer in the documentation
 | |
|   and/or other materials provided with the distribution.
 | |
| 
 | |
|   3. Neither the name of the copyright holder nor the names of its
 | |
|   contributors may be used to endorse or promote products derived from
 | |
|   this software without specific prior written permission.
 | |
| 
 | |
|   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 | |
|   AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 | |
|   IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
 | |
|   DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
 | |
|   FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 | |
|   DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 | |
|   SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
 | |
|   CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
 | |
|   OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 | |
|   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 | |
| ```
 | 
