# Epilogue Visitor Tree The Epilogue Visitor Tree is an experimental feature that directly generates epilogues from user-provide python functions. ## Usage The Epilogue Visitor tree support many different operations. ### Unary functions Epilogue Visitor Tree supports unary functions like activation functions. For example, ```python class UnaryEpilogue_(EpilogueVisitTree): def __call__( self, accum: 'tensor', c: 'tensor', alpha: 'scalar', beta: 'scalar'): # T = leaky_relu.numpy(accum, 0.2) Z = alpha * T + beta * c return Z epilogue_functor = UnaryEpilogue_( epilogue_functor, tile_description, math_inst.element_accumulator, C.alignment, element_epilogue, C.element) ``` ### Broadcast Operation Epilogue Visitor Tree supports broadcasting row and column vectors to the whole output matrix. To use broadcast, you just need to specify whether the source vector is a `row` vector or a `column` vector. Here is an example. ```python class ColumnBroadcast_(EpilogueVisitTree): def __call__( self, accum: 'tensor', c: 'tensor', vector: 'column', alpha: 'scalar', beta: 'scalar'): # T = accum + vector scale_T = leaky_relu.numpy(alpha * T, 0.2) Z = scale_T + beta * c return Z, T epilogue_functor = ColumnBroadcast_( epilogue_functor, tile_description, math_inst.element_accumulator, C.alignment, element_epilogue, C.element) ``` ### Reduction Operation Epilogue Visitor Tree also supports row and column-wise reduction in each threadblock tile. The syntax for reduction is ```python {reduction_output} = reduction_op({input_tensor}, {row|column}, {Add}, {threadblock_shape.n|threadblock_shape.m}) ``` The `{row|column}` indicates whether the `row` vectors are reduced or the `column` vectors are reduction. The `{Add}` specifies the reduction operation. The `{threadblock_shape.n|threadblock_shape.m}` are the reduction lengths. **Constraint** * The `{input_tensor}` can only be the name of source or intermediate result. `reduction_op(A + B, ...)` will not work, please use `C = A + B`, `reduction_op(C, ...)` instead. * The `{reduction_output}` cannot be used in the epilogue. It will be directly written to global memory after the reduction is done. ```python class RowReduction_(EpilogueVisitTree): def __call__( self, accum: 'tensor', c: 'tensor', alpha: 'scalar', beta: 'scalar'): # D = alpha * accum + tanh.numpy(beta * c) reduction = reduction_op(D, "row", "Add", args.threadblock_shape[1]) return D, reduction epilogue_functor = RowReduction_( epilogue_functor, tile_description, math_inst.element_accumulator, C.alignment, element_epilogue, C.element) epilogue_functor.initialize() ``` ## Get output_op As shown in the user guide, an `output_op` is required by the argument wrapper. We will take the `RowReduction_` as an example to show how to get `output_op`. ```python class RowReduction_(EpilogueVisitTree): def __call__( self, accum: 'tensor', c: 'tensor', alpha: 'scalar', beta: 'scalar'): # D = alpha * accum + tanh.numpy(beta * c) reduction = reduction_op(D, "row", "Add", args.threadblock_shape[1]) return D, reduction epilogue_functor = RowReduction_( epilogue_functor, tile_description, math_inst.element_accumulator, C.alignment, element_epilogue, C.element) epilogue_functor.initialize() cta_n = args.threadblock_shape[1] num_cta_n = (problem_size.n() + cta_n - 1) // cta_n reduction = np.zeros(shape=(args.batch * problem_size.m() * num_cta_n,), dtype=getattr(np, element_c)) # get output op output_op = operation.epilogue_type( D=tensor_D, alpha=args.alpha, beta=args.beta, c=tensor_C, reduction=reduction, problem_size=[problem_size.m(), problem_size.n()] ) ``` Like other epilogue functors such as `LinearCombination`, the output op for EpilogueVisitorTree is also created with `operation.epilogue_type(*)`. However, there are two differences: * The arguments need to be passed as keyword-arguments. The keywords are the argument names in `def __call__`. * An additional `problem_size=[problem_size.m(), problem_size.n()]` is required. ## Add new Unary Operation (e.g. Activation Function) To add additional unary operation into epilogue visitor tree, a new unary op should be created for `VisitorOpUnary`. We will take `tanh` as an example. ### Step 1: define TanhVisitor The visitor defines the parameters and computation required by the unary option. The unary operations are registered in [pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/unary_ops.h](tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/unary_ops.h). But you can define it in any header file and include the header file in [pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_unary.h](tools/library/scripts/pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_unary.h). * Two template arguments are required: * `T`: data type used to compute the unary operation * `N`: compute fragment length * We also need to provide the `Arguments` and `Params` structures. The `Arguments` will be assembled by [ctypes](https://docs.python.org/3/library/ctypes.html), the `Params` will be generated from `Arguments` automatically. If the unary function takes no argument, an integer like `int tmp` can be provide to ensure the correctness of ctypes. * The constructor can only take the `params` as the single argument. * The operation is defined in `Array operator()(Array const &frag) const `. On common way to do that is first define a scalar computation, and them use it for the fragment computation with an unrolled for-loop. * A guard function is required. If it returns `true`, it will disable all the children nodes of the unary node and return zeros to parent node. This is very helpful for multiplication with scalar while the scalar is `0`. For general cases, you can just return `true`. ```c++ // T: data type used to compute the unary operation // N: compute fragment length template struct TanhVisitor { /// Argument struct Arguments { // a placeholder argument to ensure correctness of ctypes int tmp; CUTLASS_HOST_DEVICE Arguments(): tmp(0) { }; CUTLASS_HOST_DEVICE Arguments(int tmp): tmp(tmp) { }; }; /// Param struct Params { CUTLASS_HOST_DEVICE Params(){ }; Params(Arguments const &args) { } }; /// Constructor CUTLASS_HOST_DEVICE TanhVisitor(Params const ¶ms) { } // scalar operator CUTLASS_HOST_DEVICE T tanh_op(T const &scalar) const { return fast_tanh(scalar); } /// vector operator CUTLASS_HOST_DEVICE Array operator()(Array const &frag) const { Array y; CUTLASS_PRAGMA_UNROLL for (int i=0; i < N; ++i) { y[i] = tanh_op(frag[i]); } return y; } // Guard CUTLASS_HOST_DEVICE bool guard() { return true; } }; ``` ### Step 2: register Tanh function After defining the function in C++, we need to register it in python. The class below gives an example. * The init function takes the data type `element_compute`, which will be the `T` in the C++ template. In the init function, we also generate the `_Arguments` class as a `ctypes.Structure`. It includes all the data members in the `TanhVisitor::Arguments`. * The `_Arguments` need to be registered as `self.argument_type` of `tanh` class. * A `emit` function is required to emit the namespace and typename of `TanhVisitor`. * A staticmethod as numpy reference is required to implement the python code to parse. The built-in functions are defined in [pycutlass/src/pycutlass/epilogue.py](tools/library/scripts/pycutlass/src/pycutlass/epilogue.py). You can defined yours in any file as long as it can be found by [/pycutlass/src/pycutlass/parser.py](tools/library/scripts/pycutlass/src/pycutlass/parser.py). ```python class tanh(ActivationFunctor): def __init__(self, element_compute) -> None: super().__init__() class _Arguments(ctypes.Structure): _fields_ = [ ("tmp", ctypes.c_int) ] def __init__(self, *args) -> None: self.tmp = 0 self.argument_type = _Arguments def emit(self): return "cutlass::TanhVisitor" @staticmethod def numpy(x: np.ndarray): return np.tanh(x) ``` ### Step 3: Run the function Now the new unary op is ready to use. An epilogue visitor tree can be built with ```python class RowReduction_(EpilogueVisitTree): def __call__( self, accum: NDArray['tensor', 'float32'], c: NDArray['tensor', 'float32'], alpha: 'float32', beta: 'float32'): # D = alpha * accum + tanh.numpy(beta * c) reduction = reduction_op(D, "row", "Add", args.threadblock_shape[1]) return D, reduction epilogue_functor = RowReduction_( epilogue_functor, tile_description, math_inst.element_accumulator, C.alignment, element_epilogue, C.element) epilogue_functor.initialize() ``` ## Limitations and Future work Although the Epilogue Visitor Tree brings great flexibility to epilogue construction, as the epilogue is formulated as a single tree, there are several limitations. * [Future Work] Serial and Parallel Split-K GEMM are not supported yet. * To support serial split-k, additional tree transformation pass is required to inject a `binaryOpNode(Add)` + `TensorInputNode` before each `TensorOutputNode` to fetch the partial sum back. The `semaphore` also needs to be passed into epilogue. * To support parallel split-k, an Reduction with visitor kernel is required. * [Future Work] Convolution and GEMM Grouped are not supported yet. * To support Conv2d and GEMM Grouped, corresponding *_with_visitor kernels are required. * [Limitation] If the same node is used by two operations (except that one of them is reduction), the node and all its offsprings will be executed twice. * [Limitation] The result of reduction can only be used as the return value.