cutlass/tools/library/scripts/pycutlass/README.md

# PyCUTLASS: CUTLASS Python Interface

PyCUTLASS is a python interface of CUTLASS C++ template library. PyCUTLASS takes user-defined operation descriptions, emits C++ code, and compiles it with `nvcc` or `nvrtc`. It also provides wrappers for user-provide arguments from [numpy](https://numpy.org/), [torch](https://pytorch.org/), and [cupy](https://github.com/cupy/cupy) and encode them to kernel's parameters.

```python
import pycutlass
from pycutlass import *
import torch

pycutlass.get_memory_pool(2**8, 2**32)

math_inst = MathInstruction(
    [1, 1, 1], cutlass.float32, cutlass.float32, cutlass.float32,
    cutlass.OpClass.Simt, MathOperation.multiply_add
)

tile_description = TileDescription(
    [128, 128, 8], 4, [2, 4, 1],
    math_inst
)

A = TensorDescription(
    cutlass.float32, cutlass.RowMajor, 1
)

B = TensorDescription(
    cutlass.float32, cutlass.RowMajor, 1
)

C = TensorDescription(
    cutlass.float32, cutlass.RowMajor, 1
)

epilogue_functor = LinearCombination(cutlass.float32, 1, cutlass.float32, cutlass.float32)

operation = GemmOperationUniversal(
    arch=80, tile_description=tile_description,
    A=A, B=B, C=C, 
    epilogue_functor=epilogue_functor, 
    swizzling_functor=cutlass.IdentitySwizzle1
)

pycutlass.compiler.add_module([operation,])

problem_size = cutlass.gemm.GemmCoord(512, 256, 128)

tensor_A = torch.ceil(torch.empty(size=(problem_size.m(), problem_size.k()), dtype=torch.float32, device="cuda").uniform_(-8.5, 7.5))
tensor_B = torch.ceil(torch.empty(size=(problem_size.k(), problem_size.n()), dtype=torch.float32, device="cuda").uniform_(-8.5, 7.5))
tensor_C = torch.ceil(torch.empty(size=(problem_size.m(), problem_size.n()), dtype=torch.float32, device="cuda").uniform_(-8.5, 7.5))
tensor_D = torch.empty_like(tensor_C)


alpha = 1.0
beta = 0.0

arguments = GemmArguments(
    operation=operation, problem_size=problem_size,
    A=tensor_A, B=tensor_B, C=tensor_C, D=tensor_D,
    output_op=operation.epilogue_type(alpha, beta),
    gemm_mode=cutlass.gemm.Mode.Gemm, split_k_splices=1
)

operation.run(arguments)

arguments.sync()

tensor_D_ref = alpha * tensor_A @ tensor_B + beta * tensor_C

assert torch.equal(tensor_D, tensor_D_ref)
```
PyCUTLASS also provides infrastructures for profiling, compiled artifact management, and pool memory manager 

## Supported Features
PyCUTLASS currently supports following operations:
* GEMM with mode {Serial, Parallel Split K, Batched GEMM, Array GEMM}, op class {SIMT, TensorCore}, data type {int8, f16, bf16, f32, f64}, layout {RowMajor, ColumnMajor, Row/ColumnMajorInterleaved<32> for int8}, math operation {MultiplyAdd, MultiplyAddFastF16, MultiplyAddFastBF16, MultiplyAddFastF32}, swizzling functions {IdentitySwizzle<1,2,4,8>, HorizontalSwizzle, BatchedIdentitySwizzle}, and epilogue {LinearCombination, LinearCombinationClamp}
* GEMM grouped with op class {SIMT, TensorCore}, data type {int8, f16, bf16, f32, f64}, layout {RowMajor, ColumnMajor}, math operation {MultiplyAdd, MultiplyAddFastF16, MultiplyAddFastBF16, MultiplyAddFastF32}, scheduling mode {Host, Device}, and epilogue {LinearCombination, LinearCombinationClamp}.
* Conv2d with {Fprop, Dgrad, Wgrad}, op class {SIMT, TensorCore}, data type {int8, f16, bf16, f32, f64}, layout {Tensor NHWC, TensorNC32HW32 and TensorC32RSK for int8}, math operation {MultiplyAdd, MultiplyAddFastF16, MultiplyAddFastBF16, MultiplyAddFastF32}, split-k mode {Parallel, Serial}, and epilogue {LinearCombination, LinearCombinationClamp}

The tiling size of above operations can also be customized.

## Installation

### Using Docker
You can run the PyCUTLASS on NGC PyTorch container. 
```shell
docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.09-py3
```

### Environment variables
PyCUTLASSS requires two environment variables:
* `CUTLASS_PATH`: the root directory of CUTLASS. You can set this from the location at which you cloned CUTLASS via: `export CUTLASS_PATH=$(pwd)`.
* `CUDA_INSTALL_PATH`: the directory where cuda toolkit is installed. If running in bash with `nvcc` installed under a CUDA toolkit, you can set this to the location of your `nvcc` installation via: `export CUDA_INSTALL_PATH=$(which nvcc | awk -F'/bin/nvcc' '{print $1}')`

After setting these two environment variables, PyCUTLASS can be installed with 
```shell
cd $CUTLASS_PATH/tools/library/scripts/pycutlass && bash build.sh
```

## Examples
Examples can be found in [$CUTLASS_PATH/examples/40_cutlass_py](examples/40_cutlass_py)

## Test
The test cases are listed in `$CUTLASS_PATH//tools/library/scripts/pycutlass/test`. The unit test can be run with
```shell
# Each of these tests are only supported on devices with compute capability of SM80. For other devices,
# see the basic examples in $CUTLASS_PATH/examples/40_cutlass_py
cd $CUTLASS_PATH/tools/library/scripts/pycutlass/test/unit && python test_sm80.py
cd $CUTLASS_PATH/tools/library/scripts/pycutlass/test/example && bash run_all_example.sh
```

## build documentation
Run
```shell
bash build_doc.sh
```


## Troubleshooting

### Issue 1: permission denied
Building PyCUTLASS requires installing dependencies to python. So conda could an option if you don't have permission.

### Issue 2: rmm: module not found
PyCUTLASS manages the device memory with [RMM](https://github.com/rapidsai/rmm). Our `build.sh` automatically pull the [rmm branch-22.08](https://github.com/rapidsai/rmm/tree/branch-22.08) from github and build it from source. The rmm is allocated at `$CUTLASS_PATH/tools/library/scripts/pycutlass/rmm`. It requires `cmake > 3.20.1`. If the build fails, it can be manually fixed with the following steps:
```shell
cd $CUTLASS_PATH/tools/library/scripts/pycutlass/rmm && ./build.sh librmm rmm

cd $CUTLASS_PATH/tools/library/scripts/pycutlass/rmm/python
python setup.py build_ext --inplace
python setup.py install
```
To test whether rmm is successfully installed, try `import rmm`. For other issues related to rmm, please check https://github.com/rapidsai/rmm/issues.
CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00			`# PyCUTLASS: CUTLASS Python Interface`

			PyCUTLASS is a python interface of CUTLASS C++ template library. PyCUTLASS takes user-defined operation descriptions, emits C++ code, and compiles it with `nvcc` or `nvrtc`. It also provides wrappers for user-provide arguments from [numpy](https://numpy.org/), [torch](https://pytorch.org/), and [cupy](https://github.com/cupy/cupy) and encode them to kernel's parameters.

			```python
			`import pycutlass`
			`from pycutlass import *`
			`import torch`

			`pycutlass.get_memory_pool(28, 232)`

			`math_inst = MathInstruction(`
			`[1, 1, 1], cutlass.float32, cutlass.float32, cutlass.float32,`
			`cutlass.OpClass.Simt, MathOperation.multiply_add`
			`)`

			`tile_description = TileDescription(`
			`[128, 128, 8], 4, [2, 4, 1],`
CUTLASS 2.10 updates (#622) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-13 09:26:30 +08:00			`math_inst`
CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00			`)`

			`A = TensorDescription(`
			`cutlass.float32, cutlass.RowMajor, 1`
			`)`

			`B = TensorDescription(`
			`cutlass.float32, cutlass.RowMajor, 1`
			`)`

			`C = TensorDescription(`
			`cutlass.float32, cutlass.RowMajor, 1`
			`)`

CUTLASS 2.10 updates (#622) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-13 09:26:30 +08:00			`epilogue_functor = LinearCombination(cutlass.float32, 1, cutlass.float32, cutlass.float32)`

CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00			`operation = GemmOperationUniversal(`
			`arch=80, tile_description=tile_description,`
CUTLASS 2.10 updates (#622) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-13 09:26:30 +08:00			`A=A, B=B, C=C,`
			`epilogue_functor=epilogue_functor,`
CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00			`swizzling_functor=cutlass.IdentitySwizzle1`
			`)`

			`pycutlass.compiler.add_module([operation,])`

			`problem_size = cutlass.gemm.GemmCoord(512, 256, 128)`

			`tensor_A = torch.ceil(torch.empty(size=(problem_size.m(), problem_size.k()), dtype=torch.float32, device="cuda").uniform_(-8.5, 7.5))`
			`tensor_B = torch.ceil(torch.empty(size=(problem_size.k(), problem_size.n()), dtype=torch.float32, device="cuda").uniform_(-8.5, 7.5))`
			`tensor_C = torch.ceil(torch.empty(size=(problem_size.m(), problem_size.n()), dtype=torch.float32, device="cuda").uniform_(-8.5, 7.5))`
			`tensor_D = torch.empty_like(tensor_C)`


			`alpha = 1.0`
			`beta = 0.0`

			`arguments = GemmArguments(`
			`operation=operation, problem_size=problem_size,`
			`A=tensor_A, B=tensor_B, C=tensor_C, D=tensor_D,`
CUTLASS 2.10 updates (#622) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-13 09:26:30 +08:00			`output_op=operation.epilogue_type(alpha, beta),`
CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00			`gemm_mode=cutlass.gemm.Mode.Gemm, split_k_splices=1`
			`)`

			`operation.run(arguments)`

			`arguments.sync()`

			`tensor_D_ref = alpha * tensor_A @ tensor_B + beta * tensor_C`

			`assert torch.equal(tensor_D, tensor_D_ref)`
			```
			`PyCUTLASS also provides infrastructures for profiling, compiled artifact management, and pool memory manager`

CUTLASS 2.10 updates (#622) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-13 09:26:30 +08:00			`## Supported Features`
			`PyCUTLASS currently supports following operations:`
			`* GEMM with mode {Serial, Parallel Split K, Batched GEMM, Array GEMM}, op class {SIMT, TensorCore}, data type {int8, f16, bf16, f32, f64}, layout {RowMajor, ColumnMajor, Row/ColumnMajorInterleaved<32> for int8}, math operation {MultiplyAdd, MultiplyAddFastF16, MultiplyAddFastBF16, MultiplyAddFastF32}, swizzling functions {IdentitySwizzle<1,2,4,8>, HorizontalSwizzle, BatchedIdentitySwizzle}, and epilogue {LinearCombination, LinearCombinationClamp}`
			`* GEMM grouped with op class {SIMT, TensorCore}, data type {int8, f16, bf16, f32, f64}, layout {RowMajor, ColumnMajor}, math operation {MultiplyAdd, MultiplyAddFastF16, MultiplyAddFastBF16, MultiplyAddFastF32}, scheduling mode {Host, Device}, and epilogue {LinearCombination, LinearCombinationClamp}.`
			`* Conv2d with {Fprop, Dgrad, Wgrad}, op class {SIMT, TensorCore}, data type {int8, f16, bf16, f32, f64}, layout {Tensor NHWC, TensorNC32HW32 and TensorC32RSK for int8}, math operation {MultiplyAdd, MultiplyAddFastF16, MultiplyAddFastBF16, MultiplyAddFastF32}, split-k mode {Parallel, Serial}, and epilogue {LinearCombination, LinearCombinationClamp}`

			`The tiling size of above operations can also be customized.`

CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00			`## Installation`

			`### Using Docker`
Bump CUTLASS Python container version (#672) * Update example 40 README * Update CUTLASS Python README 2022-10-23 09:09:39 +08:00			`You can run the PyCUTLASS on NGC PyTorch container.`
CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00			```shell
Bump CUTLASS Python container version (#672) * Update example 40 README * Update CUTLASS Python README 2022-10-23 09:09:39 +08:00			`docker run --gpus all -it --rm nvcr.io/nvidia/pytorch:22.09-py3`
CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00			```

			`### Environment variables`
			`PyCUTLASSS requires two environment variables:`
releaase 2.11 (#703) 2022-11-19 22:02:15 +08:00			* `CUTLASS_PATH`: the root directory of CUTLASS. You can set this from the location at which you cloned CUTLASS via: `export CUTLASS_PATH=$(pwd)`.
			* `CUDA_INSTALL_PATH`: the directory where cuda toolkit is installed. If running in bash with `nvcc` installed under a CUDA toolkit, you can set this to the location of your `nvcc` installation via: `export CUDA_INSTALL_PATH=$(which nvcc \| awk -F'/bin/nvcc' '{print $1}')`
CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00
			`After setting these two environment variables, PyCUTLASS can be installed with`
			```shell
			`cd $CUTLASS_PATH/tools/library/scripts/pycutlass && bash build.sh`
			```

			`## Examples`
CUTLASS 2.10 updates (#622) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-13 09:26:30 +08:00			`Examples can be found in [$CUTLASS_PATH/examples/40_cutlass_py](examples/40_cutlass_py)`
CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00
			`## Test`
			The test cases are listed in `$CUTLASS_PATH//tools/library/scripts/pycutlass/test`. The unit test can be run with
			```shell
Make Python interface work for non-SM80 targets (#726) * Make Python interface work for non-SM80 targets * Remove line in README 2022-12-08 10:53:33 +08:00			`# Each of these tests are only supported on devices with compute capability of SM80. For other devices,`
			`# see the basic examples in $CUTLASS_PATH/examples/40_cutlass_py`
CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00			`cd $CUTLASS_PATH/tools/library/scripts/pycutlass/test/unit && python test_sm80.py`
Make Python interface work for non-SM80 targets (#726) * Make Python interface work for non-SM80 targets * Remove line in README 2022-12-08 10:53:33 +08:00			`cd $CUTLASS_PATH/tools/library/scripts/pycutlass/test/example && bash run_all_example.sh`
CUTLASS 2.10 updates (#622) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-13 09:26:30 +08:00			```

			`## build documentation`
			`Run`
			```shell
			`bash build_doc.sh`
CUTLASS 2.10 (#615) Co-authored-by: Aniket Shivam <ashivam@nvidia.com> 2022-09-04 06:48:46 +08:00			```


			`## Troubleshooting`

			`### Issue 1: permission denied`
			`Building PyCUTLASS requires installing dependencies to python. So conda could an option if you don't have permission.`

			`### Issue 2: rmm: module not found`
			PyCUTLASS manages the device memory with [RMM](https://github.com/rapidsai/rmm). Our `build.sh` automatically pull the [rmm branch-22.08](https://github.com/rapidsai/rmm/tree/branch-22.08) from github and build it from source. The rmm is allocated at `$CUTLASS_PATH/tools/library/scripts/pycutlass/rmm`. It requires `cmake > 3.20.1`. If the build fails, it can be manually fixed with the following steps:
			```shell
			`cd $CUTLASS_PATH/tools/library/scripts/pycutlass/rmm && ./build.sh librmm rmm`

			`cd $CUTLASS_PATH/tools/library/scripts/pycutlass/rmm/python`
			`python setup.py build_ext --inplace`
			`python setup.py install`
			```
			To test whether rmm is successfully installed, try `import rmm`. For other issues related to rmm, please check https://github.com/rapidsai/rmm/issues.