We also provide functions to query the allocated size.
```python
bytes = get_allocated_size()
```
## Operation Description
PyCUTLASS provides operation description for GEMM, GEMM Grouped and Conv2d operations. These operation descriptions are assembled from four foundamental concepts
* Math Instruction: math instruction executed in GPU cores
* Tile Description: tiling sizes and pipeline stages
* Operand Description: data type, layout, memory alignment
* Epilogue Functor: epilogue function
### Math Instruction
The math instruction is defined as follows:
```python
math_inst = MathInstruction(
{instruction_shape}, {element_a}, {element_b},
{element_acc}, {opclass}, {math_operation}
)
```
The `{instruction_shape}` and `{opclass}` defines the instruction size and type. The table below lists valid combinations. `{element_a}`, `{element_b}` define the source operand data type for each instructions, and `{element_acc}` defines the accumulator type. The `{math_operation}` defines the math operation applied.
The `cutlass.OpClass.TensorOp` indicates that the tensor core is used, while `cutlass.OpClass.Simt` uses the SIMT Core.
The `multiply_add_fast_f32` emulates fast accurate SGEMM kernel which is accelerated
using Ampere Tensor Cores. More details can be found in [examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm](examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm).
### Tile Description
The tile description describes the threadblock and warp tiling sizes, as well as the pipeline stages.
```python
tile_description = TileDescription(
{threadblock_shape}, {stages}, {warp_count},
math_inst
)
```
The `{threadblock_shape}` is a list of 3 integers `[Tile_M, Tile_N, Tile_K]` that defines the threadblock tiling size. `{stages}` defines the number of software pipeline stages ([detail](https://developer.nvidia.com/blog/controlling-data-movement-to-boost-performance-on-ampere-architecture/)). `{warp_count}` defines the number of warps along `M`, `N`, and `K` dimension. I.e., with `{threadblock_shape}=[Tile_M, Tile_N, Tile_K]` and `{warp_count}=[W_M, W_N, W_K]`, the warp tile size would be `[Tile_M / W_M, Tile_N / W_N, Tile_K / W_K]`.
### Operand Description
The Operand Description defines the data type, layout, and memory alignment of input tensor A, B, and C. The output D shares the same attributes with C. The description is as follows:
```python
A = TensorDescription(
{element_a}, {layout_a}, {alignment_a}
)
B = TensorDescription(
{element_b}, {layout_b}, {alignment_b}
)
C = TensorDescription(
{element_c}, {layout_c}, {alignment_c}
)
```
The table below lists the supported layout and data types for each operation
| LinearCombinationClamp | $D=\alpha \times Accm + \beta \times C$, Output is clamped to the maximum value of the data type output |
| FastLinearCombinationClamp | $D=\alpha \times Accm + \beta \times C$, only used for problem size $K\le 256$ for cutlass.int8, with accumulator data type `cutlass.int32` and epilogue compute data type `cutlass.float32` |
| LinearCombinationGeneric | $D = activation(\alpha \times Accm + \beta \times C)$, available activations include `relu`, `leaky_relu`, `tanh`, `sigmoid`, `silu`, `hardswish`, and `gelu` |
We also provides an experimental feature "Epilogue Visitor Tree" for GEMM operation. The details can be found in [EpilogueVisitorTree](tools/library/scripts/pycutlass/docs/source/md/EpilogueVisitorTree.md).
### GEMM Operation
The GEMM Operation description can be created with
```python
operation = GemmOperationUniversal(
{compute_capability}, tile_description,
A, B, C, epilogue_functor,
{swizzling_functor}, {visitor}
)
```
*`{compute_capability}` is an integer indicates the compute capability of the GPU. For A100, it is 80.
*`{swizzling_functor}` describes how threadblocks are scheduled on GPU. This is used to improve the L2 Locality ([detail](https://developer.nvidia.com/blog/optimizing-compute-shaders-for-l2-locality-using-thread-group-id-swizzling/)). Currently we support `cutlass.{IdentitySwizzle1|IdentitySwizzle2|IdentitySwizzle4|IdentitySwizzle8|BatchedIdentitySwizzle}`. The last one is used for batched or array GEMM.
*`{visitor}`: a bool variable indicates whether the epilogue visitor tree is used.
### GEMM Grouped Operation
The GEMM Grouped Operation description can be created with
```python
operation = GemmOperationGrouped(
compute_capability, tile_description,
A, B, C, epilogue_functor,
swizzling_functor, {precompute_mode}
)
```
*`{precompute_mode}`: It could be `SchedulerMode.Host` or `SchedulerMode.Device`. See [examples/24_gemm_grouped](examples/24_gemm_grouped) for more details.
### Conv2d Operation
The Conv2d Operation description can be created with
```python
operation = Conv2dOperation(
{conv_kind}, {iterator_algorithm},
compute_capability, tile_description,
A, B, C, {stride_support},
epilogue_functor, swizzling_functor
)
```
*`{conv_kind}` defines which convolution is executed. Available options include `fprop`, `dgrad`, and `wgrad`.
*`{iterator_algorithm}` specifies the iterator algorithm used by the implicit GEMM in convolution. The options are as follows:
*`analytic`: functionally correct in all cases but lower performance
*`optimized`: optimized for R <= 32, S <= 32 and unity-stride dgrad
Several operations can be compiled together. The `nvcc` at `$CUDA_INSTALL_PATH/bin` is used by default as the compiler backend. But you can also switch to [CUDA Python](https://nvidia.github.io/cuda-python/overview.html)'s `nvrtc` with
We also have an internal compiled artifact manager that caches the compiled kernel in both memory and disk. The `compiled_cache.db` at your workspace is the database that contains the binary files. You can delete the file if you want to recompile the kernels.
***
## Argument Processing
We provide argument wrapper to convert python tensors to the kernel parameters. Currently it supports [torch.Tensor](https://pytorch.org/), [numpy.ndarray](https://numpy.org/), and [cupy.ndarray](https://cupy.dev/).
It is a list of arguments start with the scaling factor `alpha` and `beta`.
The `output_op` of EpilogueVisitorTree is slightly different. Please check [EpilogueVisitorTree](tools/library/scripts/pycutlass/docs/source/md/EpilogueVisitorTree.md) for details.
## Kernel Launching
With the arguments and operations, the kernel can be launched simply with
```python
operation.run(arguments)
```
## Sync results
We also provide function to synchronize the kernel execution. If you use `numpy`, it will also copy the result back to host. To do that, run
```python
arguments.sync()
```
If you use EpilogueVisitorTree, please call
```python
output_op.sync()
```
## Reduction Kernel behind Parallel Split-K
If you use parallel-split-K in GEMM or Conv2d, an additional reduction kernel is required. Please check [examples/40_cutlass_py](examples/40_cutlass_py) for detail.