"[](https://colab.research.google.com/github/NVIDIA/cutlass/blob/main/examples/python/03_basic_conv2d.ipynb)\n"
"This notebook requires an NVIDIA GPU. If `nvidia-smi` fails, go to Runtime -> Change runtime type -> Hardware accelerator and confirm a GPU is selected."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!#nvidia-smi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If running on Colab, you will need to install the CUTLASS Python interface. To do so, uncomment the following line and run the cell:"
"We first show you how to run a Conv2d in the forward propagation. To get started, one only needs to provide the tensors declared above to the `cutlass.op.Conv2dFprop` call. This sets up a default Conv2d fprop operation for the given device on which you are running. \n",
"\n",
"Assuming that we are runing on SM80, the default is a Conv2d that leverages FP16 Tensor Core operations.\n",
"\n",
"Calling `plan.run()` will generate the CUTLASS C++ kernel in question, compile it, and run it on the tensors we previously passed in. By setting `print_module` to `true`, the C++ code that is emitted is printed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Specifying `element_accumulator` is not required if it is the same as `element`\n",
"There are many other ways to construct a plan from `cutlass.op.Conv2dFprop` (e.g., by specifying the types of each operand, by providing representative tensors as input). For more details on these, see the documentation in the `cutlass.op.Conv2dFprop` constructor.\n",
"\n",
"We then compare the output to running the Conv2d using PyTorch. PyTorch use NCHW layout by default, so permutations are required."
"Note that one could use the same kernel just declared for tensors provided by other frameworks beyond PyTorch, such as NumPy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Declaring and running Conv2d Dgrad and Wgrad\n",
"\n",
"The Python interface also supports declaring and running backward kernels of Conv2d. To begin with, we construct the tensors for the gradient of input, output, and weight."
"The previous examples showed how it is simple to get starting running a default Conv2d kernel in CUTLASS. But, what do you do if you want a bit more control over the parameters to the Conv2d? CUTLASS Python interface exposes mutable parameters that can be set after the `plan` initialization. We summarize these in the table below.\n",
"|`iterator_algorithm`|The iterator algorithm used to access the source operands|\n",
"|`swizzling_stride`|The stride of the threadblock swizzling functor|\n",
"|`split-K`|Partitions the reduction dimension to different threadblocks|"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tile Description\n",
"\n",
"The `tile_description` defines the tiling size of each threadblock, the warp count along each dimension of the tile, the software pipeline stages, and the instruction size. Under the hood, CUTLASS enumerates the different Conv2d configuration parameters for this kernel from the CUTLASS profiler. The code below shows how one can access the tile descriptions for the kernel (e.g., threadblock and warp shape)."
"Besides tile descriptions enumerated by CUTLASS, the users can also explicitly set the `threadblockshape`, `warp_shape`, `stages`, `instruction_shape`, and `cluster_shape`. If the configuration is invalid, an exception will be raised at `plan.run()` and the detailed compilation error will be stored in `./cutlass_python_compilation_error.txt` for debugging."
"The iterator algorithm describes how sources are loaded from memory. There are some iterator algorithms optimized for specific alignments and input/output channels that have better performance. The table below illustrates the available iterator algorithms.\n",
"|Fprop | \"analytic\" | Functionally correct in all cases but lower performance |\n",
"| | \"optimized\" | Optimized for and requires `R <= 32`, `S<= 32`, and `C % alignment_input == 0`|\n",
"| | \"few_channels\" | optimized for small `C` and requires `C % alignment_input == 0`|\n",
"| | \"fixed_channels\" | optimized for small `C` and requires `C == alignment_input` |\n",
"|Dgrad | \"analytic\" | Functionally correct in all cases but lower performance |\n",
"| | \"optimized\" | Optimzed for and require `R <= 32`, `S<= 32`, `K % alignment_grad_output == 0`, and `C % alignment_weight == 0`|\n",
"|Wgrad | \"analytic\" | Functionally correct in all cases but lower performance |\n",
"| | \"optimized\" | Optimized for and require `K % alignment_grad_output == 0`, and `C % alignment_input == 0`|\n",
"\n",
"By default, the Python interface will automatically propose a suitable iterator algorithm based on the input tensors in `plan.run()`. However, the user can also specify the desired iterator algorithm as follows"
"If the iterator algorithm is invalid for the problem size in `plan.run()`, an exception will be raised."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Swizzling Stride\n",
"The swizzling changes how the tile are mapped to threadblocks to improve the L2 Locality. Given a swizzling stride `N`, the threadblock `(tb_x, tb_y)` computes tile `(tb_x / N, tb_y * N + (tb_x % N))`. Currently, stride values of `1`, `2`, `4`, and `8` are supported for `fprop`, `wgrad`, and `1`, and `4` for `dgrad`. The swizzling stride can be set with:"
"Split-K is usually applied when the Conv2d has small spatial dimensions and large reduction dimension to ensure good utilization. It further partitions the reduction dimension to different threadblocks. The CUTLASS Python interface supports two types of split-K strategies: `Parallel`, and `Serial`. \n",
"* `Parallel`: the partial results from different threadblocks are stored in a temporary buffer in the global memory. When the Conv2d is done, a separate reduction kernel is created and launched to reduce the partial results.\n",
"* `Serial`: A semaphore is used to coordinate the order of different threadblocks adding their partial results to a given output tile. A separate kernel does not need to be launched for prforming the reduction.\n",
"\n",
"While all `fprop`, `dgrad`, and `wgrad` support split-K, here we use `wgrad` as an example. "