Fundamentally, a `Tensor` represents a multidimensional array. `Tensor`s abstracts away the details of how the array's elements are organized and how the array's elements are stored. This lets users write algorithms that access multidimensional arrays generically and potentially specialize algorithms on a `Tensor`s traits. For example, the rank of the `Tensor` can be dispatched against, the `Layout` of data can be inspected, and the type of data can be verified.
A `Tensor` is represented by two template parameters: `Engine` and `Layout`.
For a description of `Layout`, please refer to [the `Layout` section](./01_layout.md).
The `Tensor` presents the same shape and access operators as the `Layout` and uses the result of the `Layout` computation to
offset and dereference a random-access iterator held by the `Engine`.
That is, the layout of the data is provided by `Layout` and the actual data is provided by the iterator. Such data can live in any kind of memory -- global memory, shared memory, register memory -- or can even be transformed or generated on the fly.
which displays the pointer type along with any memory space tags, the pointer's `value_type` width, the raw pointer address, and the associated `Layout`.
### Owning Tensors
A `Tensor` can also be an owning array of memory.
Owning `Tensor`s are created by calling `make_tensor<T>`,
where `T` is the type of each element of the array, and
a `Layout` or arguments to construct a `Layout`.
The array is allocated analogously to `std::array<T,N>` and, therefore, owning `Tensor`s must be constructed with a `Layout` that has static shapes and static strides.
CuTe does not perform dynamic memory allocation in `Tensor`s as it is not a common or performant operation within CUDA kernels.
The `make_tensor_like` function makes an owning Tensor of register memory with the same value type and shape as its input `Tensor` argument and attempts to use the same order of strides as well.
Calling `print` on each of the above tensors produces similar output
```
rmem_4x8_col : ptr[32b](0x7ff1c8fff820) o (_4,_8):(_1,_4)
rmem_4x8_row : ptr[32b](0x7ff1c8fff8a0) o (_4,_8):(_8,_1)
rmem_4x8_pad : ptr[32b](0x7ff1c8fff920) o (_4,_8):(_2,_32)
rmem_4x8_like : ptr[32b](0x7f4158fffc60) o (_4,_8):(_8,_1)
```
and we can see that each pointer address is unique indicating that each `Tensor` is a unique array-like allocation.
## Accessing a Tensor
Users can access the elements of a `Tensor` via `operator()` and `operator[]`,
which take `IntTuple`s of logical coordinates.
When users access a `Tensor`,
the `Tensor` uses its `Layout` to map the logical coordinate
to an offset that can be accessed by the iterator.
You can see this in `Tensor`'s implementation of `operator[]`.
```c++
template <classCoord>
decltype(auto) operator[](Coord const& coord) {
return data()[layout()(coord)];
}
```
For example, we can read and write to `Tensor`s using natural coordinates, using the variadic `operator()`, or the container-like `operator[]`.
```c++
Tensor A = make_tensor<float>(Shape <Shape<_4,_5>,_13>{},
Stride<Stride<_12,_1>,_64>{});
Tensor B = make_tensor(b_ptr, make_shape(13, 20));
// Fill A via natural coordinates op[]
for (int m0 = 0; m0 <size<0,0>(A); ++m0)
for (int m1 = 0; m1 <size<0,1>(A); ++m1)
for (int n = 0; n <size<1>(A); ++n)
A[make_coord(make_coord(m0,m1),n)] = n + 2 * m0;
// Transpose A into B using variadic op()
for (int m = 0; m <size<0>(A); ++m)
for (int n = 0; n <size<1>(A); ++n)
B(n,m) = A(m,n);
// Copy B to A as if they are arrays
for (int i = 0; i <A.size();++i)
A[i] = B[i];
```
## Tiling a Tensor
Many of the [`Layout` algebra operations](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/02_layout_algebra.md) can also be applied to `Tensor`.
```cpp
composition(Tensor, Tiler)
logical_divide(Tensor, Tiler)
zipped_divide(Tensor, Tiler)
tiled_divide(Tensor, Tiler)
flat_divide(Tensor, Tiler)
```
The above operations allows arbitrary subtensors to be "factored out" of `Tensor`s. This very commonly used in tiling for threadgroups, tiling for MMAs, and reodering tiles of data for threads.
Note that the `_product` operations are not implemented for `Tensor`s as those would
often produce layouts with increased codomain sizes, which means the `Tensor` would
require accessing elements unpredictably far outside its previous bounds. `Layout`s can be
used in products, but not `Tensor`s.
## Slicing a Tensor
Whereas accessing a `Tensor` with a coordinate will return an element of that tensor,
slicing a `Tensor` will return a subtensor of all the elements in the sliced mode(s).
Slices are performed through the same `operator()`
that are used for accessing an individual element.
Passing in `_` (the underscore character, an instance of the `cute::Underscore` type)
has the same effect as `:` (the colon character) in Fortran or Matlab:
retain that mode of the tensor as if no coordinate had been used.
Slicing a tensor performs two operations,
* the `Layout` is evaluated on the partial coordinate and the resulting offset is accumulated into the iterator -- the new iterator points to the start of the new tensor.
* the `Layout` modes cooresponding to `_`-elements of the coordinate are used to construct a new layout.
Together, the new iterator and the new layout construct the new tensor.
```cpp
// ((_3,2),(2,_5,_2)):((4,1),(_2,13,100))
Tensor A = make_tensor(ptr, make_shape (make_shape (Int<3>{},2), make_shape ( 2,Int<5>{},Int<2>{})),
In the image above, a `Tensor` is sliced in various ways and the subtensors generated by those slices are highlighted within the original tensor. Note that tensor `C` and `D` contain the same elements, but have different ranks and shapes due to the use of `_` versus the use of `make_coord(_,_)`. In each case, the rank of the result is equal to the number of `Underscore`s in the slicing coordinate.
## Partitioning a Tensor
To implement generic partitioning of a `Tensor`, we apply composition or tiling followed by a slicing. This can be performed in many ways, but we have found three ways that are particularly useful: inner-partitioning, outer-partitioning, and TV-layout-partitioning.
### Inner and outer partitioning
Let's take a tiled example and look at how we can slice it in useful ways.
```cpp
Tensor A = make_tensor(ptr, make_shape(24,8)); // (8,24)
Suppose that we want to give each threadgroup one of these 4x8 tiles of data. Then we can use our threadgroup coordinate to index into the second mode.
We call this an *inner-partition* because it keeps the inner "tile" mode. This pattern of applying a tiler and then slicing out that tile by indexing into the remainder mode is common and has been wrapped into its own function `inner_partition(Tensor, Tiler, Coord)`. You'll often see `local_tile(Tensor, Tiler, Coord)` which is just another name for `inner_partition`. The `local_tile` partitioner is very often applied at the threadgroup level to partition tensors into tiles across threadgroups.
Alternatively, suppose that we have 32 threads and want to give each thread one element of these 4x8 tiles of data. Then we can use our thread to index into the first mode.
We call this an *outer-partition* because it keeps the outer "rest" mode. This pattern of applying a tiler and then slicing into that tile by indexing into the tile mode is common and has been wrapped into its own function `outer_partition(Tensor, Tiler, Coord)`. Sometimes you'll see `local_partition(Tensor, Layout, Idx)`, which is a rank-sensitive wrapper around `outer_partition` that transforms the `Idx` into a `Coord` using the inverse of the `Layout` and then constructs a `Tiler` with the same top-level shape of the `Layout`. This allows the user to ask for a row-major, column-major, or arbitrary layout of threads with a given shape that can be used to partition into a tensor.
To see how these partitioning patterns are used, see the [introductory GEMM tutorial](./0x_gemm_tutorial.md).
### Thread-Value partitioning
Another common partitioning strategy is called a thread-value partitioning. In this pattern, we construct a `Layout` that represents the mapping of all threads (or any parallel agent) and all values that each thread will receive to coordinates of the target data. With `composition` the target data layout is transformed according to our TV-layout and then we can simply slice into the thread-mode of the result with our thread index.
```cpp
// Construct a TV-layout that maps 8 thread indices and 4 value indices
// to 1D coordinates within a 4x8 tensor
// (T8,V4) -> (M4,N8)
auto tv_layout = Layout<Shape<Shape<_2,_4>,Shape <_2,_2>>,
Stride<Stride<_8,_1>,Stride<_4,_16>>{}; // (8,4)
// Construct a 4x8 tensor with any layout
Tensor A = make_tensor<float>(Shape<_4,_8>{}, LayoutRight{}); // (4,8)
// Compose A with the tv_layout to transform its shape and order
Tensor tv = composition(A, tv_layout); // (8,4)
// Slice so each thread has 4 values in the shape and order that the tv_layout prescribes
The above image is a visual representation of the above code. An arbitrary 4x8 layout of data is composed with a specific 8x4 TV-layout that represents a partitioning pattern. The result of the composition is on the right where each threads' values are arranged across each row. The bottom layout depicts the inverse TV layout which shows the mapping of 4x8 logical coordinates to the thread id and value id they will be mapped to.
Extending this example using the tiling utilities detailed in [the `Layout` algebra section](./02_layout_algebra.md), we can copy an arbitrary subtile of a tensor using almost the same code as above.
This applies a statically shaped `Tiler` to the global memory `Tensor`, creates an register `Tensor` that is compatible with the shape of that tile, then loops through each tile to copy it into memory and `do_something`.
## Summary
*`Tensor` is defined as an `Engine` and a `Layout`.
*`Engine` is an iterator that can be offset and dereferenced.
*`Layout` defines the logical domain of the tensor and maps coordinates to offsets.
* Tile a `Tensor` using the same methods for tiling `Layout`s.