This includes tiling and partitioning `Layout`s across other `Layout`s.
In this section, we explain some of these core operations in detail.
## How do I print CuTe objects on host or device?
CuTe comes with different ways to print CuTe objects.
You can print human-readable text,
or you can print LaTeX commands for generating
a beautifully formatted and colored table
describing the CuTe object.
Both of these can be helpful for reasoning about or debugging
layouts, copy atoms, or matrix multiply atoms
(don't worry, we'll explain all of these things in this tutorial).
CuTe's print functions work on either host or device.
Note that on device, printing is expensive.
Even just leaving print code in place on device,
even if it is never called
(e.g., printing in an `if` branch that is not taken at run time),
may generate slower code.
Thus, be sure to remove code that prints on device after debugging.
The following code examples assume that you have a
`using namespace cute;` statement in scope.
### Printing human-readable text
The `cute::print` function has overloads for almost all CuTe types, including Pointers, Layout, Shape, Stride, and Tensors. When in doubt, try calling `print` on it. You might also only want to print on thread 0 of each thread block, or block 0 of the grid. The `thread0()` function returns true only for global thread 0 of the kernel. A typical idiom for printing CuTe objects to print only on thread 0 of block 0.
```c++
if (thread0()) {
print(some_cute_object);
}
```
Some algorithms do different things on different threads or blocks,
so you might sometimes need to print on threads or blocks other than zero.
An Underscore is a special type used for array slices. The underscore punctuation `_` is a constant instance of Underscore. It acts like `:` (the colon punctuation) in Python or Fortran array slices. See [`include/cute/underscore.hpp`](../../../include/cute/underscore.hpp).
### Tile
"A Tile is not a Layout, it's a tuple of Layouts or Tiles or Underscores."
See [`include/cute/tile.hpp`](../../../include/cute/tile.hpp).
The algebraic layout operations discussed below are defined on `Layout`s, but `Tile` allows these operations to recurse and to be applied to sublayouts or particular modes of a given Layout. These are referred to as by-mode operations.
See the section on "Logical Divide" to see an example of using `Tile` to extract portions of a row-mode and portions of a column-mode independently.
## Layout definitions and operations
### Layouts are functions from integers (logical 1-D coordinate) to integers (1-D index)
The `for` loop in the above print example shows how CuTe identifies 1-D coordinates with a column-major layout of logical 2-D coordinates. Iterating from `i = 0` to `size(layout)` (which is 6), and indexing into our layout with the single integer coordinate `i`, traverses the layout in column-major fashion, even though this is a row-major layout. You can see this from the output of the `for` loop (0, 3, 1, 4, 2, 5). CuTe calls this index `i` a "1-D coordinate," versus the "natural coordinate," which would be the logical 2-D coordinate.
If you're familiar with the C++23 feature `mdspan`,
this is an important difference between
`mdspan` layout mappings and CuTe `Layout`s.
`mdspan` layout mappings are *one way*:
they always take a multidimensional logical coordinate,
and they return an integer offset.
Depending on the strides,
the offset may skip over elements of the physical 1-D array.
Thus, `mdspan`'s offset does NOT mean the same thing as
the 1-D logical coordinate `i` in the `for` loop above.
You can iterate correctly over any CuTe `Layout`
by using the 1-D logical coordinate.
`mdspan` doesn't have an idea of a 1-D logical coordinate.
### Rank, depth, size, cosize
*Rank*: the tuple size of the layout's shape.
*Depth*: the depth of the layout's shape. A single integer has depth 0. A tuple has depth 1 + the max depth of its components.
*Size*: Size of the shape; size of the domain of the function. This is the product of all extents in the layout's shape.
*Cosize*: Size of the function's codomain (not necessarily the range); for a layout A, A(size(A) - 1) + 1. (Here, we use size(A) - 1 as a 1-D logical coordinate input.)
### Layout compatibility
We say that layouts A and B are *compatible* if their shapes are compatible. Shape A is compatible with shape B if any natural coordinate of A is also a valid coordinate for B.
### Flatten
The `flatten` operation "un-nests" a potentially nested Layout. For example,
```c++
Layout layout = Layout<Shape<Shape<_4,_3>, _1>,
Stride<Stride<_3,_1>, _0>>{};
Layout flat_layout = flatten(layout);
```
results in `flat_layout` having the following type
```text
Layout<Shape<_4,_3,_1>, Stride<_3,_1,_0>>
```
and
```c++
Layout layout = Layout<Shape<_4,Shape<_4,_2>>,
Stride<_4,Stride<_1,_16>>>{};
Layout flat_layout = flatten(layout);
```
results in `flat_layout` having the following type
```text
Layout<Shape<_4,_4,_2>, Stride<_4,_1,_16>>
```
Hierarchical Layouts and flattening let us reinterpret tensors in place as matrices, matrices as vectors, vectors as matrices, etc. This lets us implement arbitrary tensor contractions as batched matrix multiply, by combining the contraction modes into a single mode, and combining the A, B, C, and "batch" modes as needed to reach the desired form.
### Coalesce
The `coalesce` operation first flattens the layout, then combines all the modes that are possible to combine, starting with mode 0 (the leftmost mode) and moving right. If all the modes can be combined, then this results in a 1-D layout expressing what array elements the original layout accesses.
For example,
```text
layout: (_2,(_1,_6)):(_1,(_6,_2))
coalesce(layout): _12:_1
```
What does it mean to "combine" modes? In the above example, the flattened layout is (2, 1, 6) : (1, 6, 2).
1. If we look at the leftmost two modes, this is just a vector of length 2 and stride 1. The middle mode has extent 1, so the corresponding stride 6 would not be observed anyway. This leaves us with (2, 6) : (1, 2).
2. The intermediate result (2, 6) : (1, 2) is just a 2 x 6 column-major matrix, which can be coalesced into a vector of length 12 and stride 1.
More formally, "combining all the modes" means a left fold, where the binary operation that combines two modes has three cases.
1. If the leftmost layout is s1:d1, and the next layout is 1:d0, then combine into s1:d1. This generalizes Step 1 above. If a mode has extent 1, we can't observe its stride, so we can skip the mode.
2. If the leftmost layout is 1:d1, and the next layout is s0:d0, then combine into s0:d0. Again, if a mode has extent 1, we can't observe its stride, so we can skip the mode.
3. If the leftmost layout is s1:d1, and the next layout is s0 : s1*d1, then combine into s0 * s1 : d1. This generalizes Step 2 above. One can call this "noticing a column-major layout sequence."
That's it! For example, the result of coalescing the row-major layout (2, 2) : (2, 1) is (2, 2) : (2, 1), the same layout, because none of the above three cases applies.
### Complement
#### Definition
The complement B of a layout A with respect to an integer M satisfies the following properties.
1. $A$ and $B$ are *disjoint*: $A(x) \neq B(x)$ for all $x \neq 0$ in the domain of $A$.
3. B is *bounded* by M: $size(B) \geq M / size(A)$, and $cosize(B) \leq floor(M / cosize(A)) * cosize(A)$.
Regarding disjointness: we need to specify $x \neq 0$ because CuTe layouts are linear. That is, if the domain is nonempty, the range always contains zero.
Regarding the ordered property: CuTe layouts are hierarchically strided, so this implies that if size(B) is nonzero, then the strides of B are all positive.
#### Examples
complement(4:1, 24) is 6:4.
1. The result is disjoint of 4:1, so it must have a stride of at least 4 (since it includes 0, but must skip over 1, 2, 3).
2. The size of the result is $\geq 24 / 4 = 6$. (This plus Step (1) means that the cosize is at least 24.)
3. The cosize of the result is $\leq (24 / 4) * 4 = 24$. (This plus Step (2) means that the cosize is exactly 24.)
4. The only (1-D) layout with size 6 and cosize 24 is 6:4.
complement(6:4, 24) is 4:1.
1. 4:1 is disjoint of 6:4, but so is s:d
for any s > 0 and d > 20.
2. The size of the result is $\geq 24 / 6 = 4$.
3. The cosize of the result is $\leq (24 / 21) * 21 = 21$.
4. The stride cannot be greater than 20
(else (2) would contradict (3)),
so it must be less than 4.
5. This leaves 4:1 by elimination.
### Composition
Layouts are functions, so composition of layouts is just composition of functions. The composition $A \circ B$ means "apply the layout B first, then treat the result as a 1-D logical coordinate input to the layout A, and apply A to it." Very often, this composition can be represented as another Layout.
#### Rules for computing composition
Both humans and CuTe compute composition using the following rules.
1. $A \circ B$ has a shape that is compatible with B. In function composition, the rightmost function defines the domain. For `Layout`s this means that any valid coordinate for $B$ can also be used as a coordinate for $A \circ B$.
2. Concatenation: A layout can be expressed as the concatenation of its sublayouts. We denote concatenation with parentheses: $B = (B_0,B_1,...)$. The CuTe function `make_layout`, when given zero or more `Layout`s, concatenates them.
3. Composition is (left-)distributive with concatenation: $A \circ B = A \circ (B0, B1, ...) = (A \circ B0, A \circ B1, ...)$.
4. "Base case": For layouts $A = a : b$ and $B = c : d$ with integral shape and stride, $A \circ B = R = c : (b * d)$.
5. By-mode composition: Let $\langle B, C \rangle$ (angle brackets, not parentheses)
denote a tuple of two layouts B and C, not their concatenation. Let A = (A0, A1).
This allows the application of composition independently to sublayouts of $A$.
#### Examples: Reshape a vector into a matrix
This section gives two composition examples. Both start with a vector with layout $20:2$ (that is, the vector has 20 elements, and the stride between each is 2). They compose this vector with a 4 x 5 matrix layout. This effectively "reshapes" the vector in place into a matrix.
##### Example 1
$20:2 \circ (4,5) : (1,4)$.
This describes interpreting the vector $20:2$
as a 4 x 5 column-major matrix.
The resulting layout has shape $(4,5)$,
because in function composition,
the rightmost function defines the domain.
What are the strides?
1. A layout can be expressed as the concatenation of its sublayouts,
The row and column labels use the equivalence of 1-D logical coordinates and 2-D column-major coordinates. The left index in each pair is the row resp. column coordinate of the tile, while the right index in each pair is the row resp. column coordinate of the matrix-of-tiles. The resulting layout has Shape ((2, 3), (2, 4)), and Stride ((1, 16), (2, 4)), and the second mode can be coalesced. The Shape ((2, 3), (2, 4)) is hierarchical, but it is still rank-2 and can be drawn in 2D as above. Note how the row mode of the tile remains part of the row mode of the product, and the column mode of the tile remains a column mode of the product.
The above layout is what `blocked_product(tile, matrix_of_tiles)` produces.
A critical use case for blocked product is "tiling" an "atom"
(some tile that relates to a hardware feature) over a matrix.
The `tiled_divide` function works like `zipped_divide`,
except that it unpacks the second mode. This is useful when you have a `Tile` that describes all of the elements for a particular operation, for example, and want to gather those together but retain the logical shape of those tiles within the original layout. That is,
```text
Layout Shape : (M, N, L, ...)
Tile Shape : <M',N'>
Tiled Result : ((M', N'), m, n, L, ...)
```
where `m` is `M / M'` and `n` is `N / N'`.
We can consider `m` as the "number of `Tile`s in `M`" and `n` as the "number of `Tile`s in `N`". This style of operation is common when applying MMA Atoms and Copy Atoms.