10 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	Epilogue Visitor Tree
The Epilogue Visitor Tree is an experimental feature that directly generates epilogues from user-provide python functions.
Usage
The Epilogue Visitor tree support many different operations.
Unary functions
Epilogue Visitor Tree supports unary functions like activation functions. For example,
class UnaryEpilogue_(EpilogueVisitTree):
    def __call__(
        self, accum: 'tensor', c: 'tensor', 
        alpha: 'scalar', beta: 'scalar'):
        #
        T = leaky_relu.numpy(accum, 0.2)
        Z = alpha * T + beta * c
        return Z
epilogue_functor = UnaryEpilogue_(
    epilogue_functor, tile_description, math_inst.element_accumulator, 
    C.alignment, element_epilogue, C.element)
Broadcast Operation
Epilogue Visitor Tree supports broadcasting row and column vectors to the whole output matrix. To use broadcast, you just need to specify whether the source vector is a row vector or a column vector. Here is an example.
class ColumnBroadcast_(EpilogueVisitTree):
    def __call__(
        self, accum: 'tensor',  c: 'tensor', 
        vector: 'column', alpha: 'scalar', beta: 'scalar'):
        #
        T = accum + vector
        scale_T = leaky_relu.numpy(alpha * T, 0.2)
        Z = scale_T + beta * c
        return Z, T
epilogue_functor = ColumnBroadcast_(
    epilogue_functor, tile_description, math_inst.element_accumulator, 
    C.alignment, element_epilogue, C.element)
Reduction Operation
Epilogue Visitor Tree also supports row and column-wise reduction in each threadblock tile. The syntax for reduction is
{reduction_output} = reduction_op({input_tensor}, {row|column}, {Add}, {threadblock_shape.n|threadblock_shape.m})
The {row|column} indicates whether the row vectors are reduced or the column vectors are reduction. The {Add} specifies the reduction operation. The {threadblock_shape.n|threadblock_shape.m} are the reduction lengths.
Constraint
- The {input_tensor}can only be the name of source or intermediate result.reduction_op(A + B, ...)will not work, please useC = A + B,reduction_op(C, ...)instead.
- The {reduction_output}cannot be used in the epilogue. It will be directly written to global memory after the reduction is done.
class RowReduction_(EpilogueVisitTree):
    def __call__(
        self, accum: 'tensor',  c: 'tensor', 
        alpha: 'scalar', beta: 'scalar'):
        #
        D = alpha * accum + tanh.numpy(beta * c)
        reduction = reduction_op(D, "row", "Add", args.threadblock_shape[1])
        return D, reduction
epilogue_functor = RowReduction_(
    epilogue_functor, tile_description, math_inst.element_accumulator, 
    C.alignment, element_epilogue, C.element)
epilogue_functor.initialize()
Get output_op
As shown in the user guide, an output_op is required by the argument wrapper. We will take the RowReduction_ as an example to show how to get output_op.
class RowReduction_(EpilogueVisitTree):
    def __call__(
        self, accum: 'tensor',  c: 'tensor', 
        alpha: 'scalar', beta: 'scalar'):
        #
        D = alpha * accum + tanh.numpy(beta * c)
        reduction = reduction_op(D, "row", "Add", args.threadblock_shape[1])
        return D, reduction
epilogue_functor = RowReduction_(
    epilogue_functor, tile_description, math_inst.element_accumulator, 
    C.alignment, element_epilogue, C.element)
epilogue_functor.initialize()
cta_n = args.threadblock_shape[1]
num_cta_n = (problem_size.n() + cta_n - 1) // cta_n
reduction = np.zeros(shape=(args.batch * problem_size.m() * num_cta_n,), dtype=getattr(np, element_c))
# get output op
output_op = operation.epilogue_type(
    D=tensor_D, alpha=args.alpha, beta=args.beta, c=tensor_C, reduction=reduction, problem_size=[problem_size.m(), problem_size.n()]
)
Like other epilogue functors such as LinearCombination, the output op for EpilogueVisitorTree is also created with operation.epilogue_type(*). However, there are two differences:
- The arguments need to be passed as keyword-arguments. The keywords are the argument names in def __call__.
- An additional problem_size=[problem_size.m(), problem_size.n()]is required.
Add new Unary Operation (e.g. Activation Function)
To add additional unary operation into epilogue visitor tree, a new unary op
should be created for VisitorOpUnary. We will take tanh as an example.
Step 1: define TanhVisitor
The visitor defines the parameters and computation required by the unary option. The unary operations are registered in pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/unary_ops.h. But you can define it in any header file and include the header file in pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_unary.h.
- Two template arguments are required:
- T: data type used to compute the unary operation
- N: compute fragment length
 
- We also need to provide the ArgumentsandParamsstructures. TheArgumentswill be assembled by ctypes, theParamswill be generated fromArgumentsautomatically. If the unary function takes no argument, an integer likeint tmpcan be provide to ensure the correctness of ctypes.
- The constructor can only take the paramsas the single argument.
- The operation is defined in Array<T, N> operator()(Array<T, N> const &frag) const. On common way to do that is first define a scalar computation, and them use it for the fragment computation with an unrolled for-loop.
- A guard function is required. If it returns true, it will disable all the children nodes of the unary node and return zeros to parent node. This is very helpful for multiplication with scalar while the scalar is0. For general cases, you can just returntrue.
// T: data type used to compute the unary operation
// N: compute fragment length
template <typename T, int N>
struct TanhVisitor {
    /// Argument
    struct Arguments {
        // a placeholder argument to ensure correctness of ctypes
        int tmp;
        CUTLASS_HOST_DEVICE
        Arguments(): tmp(0) { };
        CUTLASS_HOST_DEVICE
        Arguments(int tmp): tmp(tmp) { };
    };
    /// Param
    struct Params {
        CUTLASS_HOST_DEVICE
        Params(){ };
        Params(Arguments const &args) { }
    };
    /// Constructor
    CUTLASS_HOST_DEVICE
    TanhVisitor(Params const ¶ms) { }
    // scalar operator
    CUTLASS_HOST_DEVICE
    T tanh_op(T const &scalar) const {
        return fast_tanh(scalar);
    }
    /// vector operator
    CUTLASS_HOST_DEVICE
    Array<T, N> operator()(Array<T, N> const &frag) const {
        Array<T, N> y;
        CUTLASS_PRAGMA_UNROLL
        for (int i=0; i < N; ++i) {
            y[i] = tanh_op(frag[i]);
        }
        return y;
    }
    // Guard
    CUTLASS_HOST_DEVICE
    bool guard() {
        return true;
    }
};
Step 2: register Tanh function
After defining the function in C++, we need to register it in python. The class below gives an example.
- The init function takes the data type element_compute, which will be theTin the C++ template. In the init function, we also generate the_Argumentsclass as actypes.Structure. It includes all the data members in theTanhVisitor::Arguments.
- The _Argumentsneed to be registered asself.argument_typeoftanhclass.
- A emitfunction is required to emit the namespace and typename ofTanhVisitor.
- A staticmethod as numpy reference is required to implement the python code to parse.
The built-in functions are defined in pycutlass/src/pycutlass/epilogue.py. You can defined yours in any file as long as it can be found by /pycutlass/src/pycutlass/parser.py.
class tanh(ActivationFunctor):
    def __init__(self, element_compute) -> None:
        super().__init__()
        class _Arguments(ctypes.Structure):
            _fields_ = [
                ("tmp", ctypes.c_int)
            ]
            def __init__(self, *args) -> None:
                self.tmp = 0
        self.argument_type = _Arguments
    
    def emit(self):
        return "cutlass::TanhVisitor"
    @staticmethod
    def numpy(x: np.ndarray):
        return np.tanh(x)
Step 3: Run the function
Now the new unary op is ready to use. An epilogue visitor tree can be built with
class RowReduction_(EpilogueVisitTree):
    def __call__(
        self, accum: NDArray['tensor', 'float32'],  c: NDArray['tensor', 'float32'], 
        alpha: 'float32', beta: 'float32'):
        #
        D = alpha * accum + tanh.numpy(beta * c)
        reduction = reduction_op(D, "row", "Add", args.threadblock_shape[1])
        return D, reduction
epilogue_functor = RowReduction_(
    epilogue_functor, tile_description, math_inst.element_accumulator, 
    C.alignment, element_epilogue, C.element)
epilogue_functor.initialize()
Limitations and Future work
Although the Epilogue Visitor Tree brings great flexibility to epilogue construction, as the epilogue is formulated as a single tree, there are several limitations.
- 
[Future Work] Serial and Parallel Split-K GEMM are not supported yet. - To support serial split-k, additional tree transformation pass is required to inject a binaryOpNode(Add)+TensorInputNodebefore eachTensorOutputNodeto fetch the partial sum back. Thesemaphorealso needs to be passed into epilogue.
- To support parallel split-k, an Reduction with visitor kernel is required.
 
- To support serial split-k, additional tree transformation pass is required to inject a 
- 
[Future Work] Convolution and GEMM Grouped are not supported yet. - To support Conv2d and GEMM Grouped, corresponding *_with_visitor kernels are required.
 
- 
[Limitation] If the same node is used by two operations (except that one of them is reduction), the node and all its offsprings will be executed twice. 
- 
[Limitation] The result of reduction can only be used as the return value. 
