parent
9fb38ac048
commit
a101ac283f
@ -293,7 +293,7 @@ mapping of 2.x layout tags to corresponding M-major, N-major, or K-major strides
|
||||
| Matrix | CUTLASS 2.x layout | 2.x Shape | Logical major mode| 3.x Shape/Stride | Major ordinal |
|
||||
| --- | --- | --- | --- | --- | --- |
|
||||
| A | `ColumnMajor` | M x K | M major | M x K x L | 0 (outer) |
|
||||
| A | `RowMajor` | M x K | K major | N x K x L | 1 (inner) |
|
||||
| A | `RowMajor` | M x K | K major | M x K x L | 1 (inner) |
|
||||
| B | `RowMajor` | K x N | N major | N x K x L | 0 (outer) |
|
||||
| B | `ColumnMajor` | K x N | K major | N x K x L | 1 (inner) |
|
||||
| C | `ColumnMajor` | M x N | M major | M x N x L | 0 (outer) |
|
||||
|
@ -229,7 +229,7 @@ as part of the kernel design. A thread block is partitioned into two sets of war
|
||||
**Warp-Specialized Persistent kernel design**
|
||||
|
||||
Another flavor of Warp Specialized kernel design being introduced starting with Hopper is the [*Warp-Specialized Persistent*](/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_persistent.hpp) kernel. Like Warp Specialized kernel the concepts of warp groups and barrier synchronization between warp groups remain the same in the persistent design. The distinctive feature of the Warp-Specialized Persistent kernel are the following :
|
||||
* Persistent thread blocks launched to occupy as many SMs as mentioned in the [KernelHardwareInfo](include/cutlass/kernel_hardware_info.hpp) struct. These persistent thread blocks are used to tile the output and thus (potentially) compute multiple output tiles through their lifetime. The main benefit this adds is amortization of the thread-block launch and kernel prologue overheads which are typical of all kernels.
|
||||
* Persistent thread blocks launched to occupy as many SMs as mentioned in the [KernelHardwareInfo](/include/cutlass/kernel_hardware_info.hpp) struct. These persistent thread blocks are used to tile the output and thus (potentially) compute multiple output tiles through their lifetime. The main benefit this adds is amortization of the thread-block launch and kernel prologue overheads which are typical of all kernels.
|
||||
* Presence of one two *consumer* warp groups which allows for *epilogue* of one *consumer* warp group to be overlapped with the math operations of the other *consumer* warp group - thus maximizing tensor core utilization.
|
||||
|
||||
Each *consumer* warp group is assigned a different output tile. The *producer* warp group synchronizes using the [Ordered Sequence Barrier](/include/cutlass/pipeline.hpp) to fill buffers of the two *consumer* warp groups one after the other in order. Since each thread block now computes multiple output tiles, the shape of the grid launch and the scheduling of tiles to the thread blocks is managed using the new [*Tile Scheduler*](/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp). The *Tile Scheduler* considers the shape of the *clusters* as well as the available number of available SMs to compute a valid scheduling of the output tiles to launched thread blocks.
|
||||
|
Loading…
Reference in New Issue
Block a user