10 KiB
Epilogue Visitor Tree
The Epilogue Visitor Tree is an experimental feature that directly generates epilogues from user-provide python functions.
Usage
The Epilogue Visitor tree support many different operations.
Unary functions
Epilogue Visitor Tree supports unary functions like activation functions. For example,
class UnaryEpilogue_(EpilogueVisitTree):
def __call__(
self, accum: 'tensor', c: 'tensor',
alpha: 'scalar', beta: 'scalar'):
#
T = leaky_relu.numpy(accum, 0.2)
Z = alpha * T + beta * c
return Z
epilogue_functor = UnaryEpilogue_(
epilogue_functor, tile_description, math_inst.element_accumulator,
C.alignment, element_epilogue, C.element)
Broadcast Operation
Epilogue Visitor Tree supports broadcasting row and column vectors to the whole output matrix. To use broadcast, you just need to specify whether the source vector is a row
vector or a column
vector. Here is an example.
class ColumnBroadcast_(EpilogueVisitTree):
def __call__(
self, accum: 'tensor', c: 'tensor',
vector: 'column', alpha: 'scalar', beta: 'scalar'):
#
T = accum + vector
scale_T = leaky_relu.numpy(alpha * T, 0.2)
Z = scale_T + beta * c
return Z, T
epilogue_functor = ColumnBroadcast_(
epilogue_functor, tile_description, math_inst.element_accumulator,
C.alignment, element_epilogue, C.element)
Reduction Operation
Epilogue Visitor Tree also supports row and column-wise reduction in each threadblock tile. The syntax for reduction is
{reduction_output} = reduction_op({input_tensor}, {row|column}, {Add}, {threadblock_shape.n|threadblock_shape.m})
The {row|column}
indicates whether the row
vectors are reduced or the column
vectors are reduction. The {Add}
specifies the reduction operation. The {threadblock_shape.n|threadblock_shape.m}
are the reduction lengths.
Constraint
- The
{input_tensor}
can only be the name of source or intermediate result.reduction_op(A + B, ...)
will not work, please useC = A + B
,reduction_op(C, ...)
instead. - The
{reduction_output}
cannot be used in the epilogue. It will be directly written to global memory after the reduction is done.
class RowReduction_(EpilogueVisitTree):
def __call__(
self, accum: 'tensor', c: 'tensor',
alpha: 'scalar', beta: 'scalar'):
#
D = alpha * accum + tanh.numpy(beta * c)
reduction = reduction_op(D, "row", "Add", args.threadblock_shape[1])
return D, reduction
epilogue_functor = RowReduction_(
epilogue_functor, tile_description, math_inst.element_accumulator,
C.alignment, element_epilogue, C.element)
epilogue_functor.initialize()
Get output_op
As shown in the user guide, an output_op
is required by the argument wrapper. We will take the RowReduction_
as an example to show how to get output_op
.
class RowReduction_(EpilogueVisitTree):
def __call__(
self, accum: 'tensor', c: 'tensor',
alpha: 'scalar', beta: 'scalar'):
#
D = alpha * accum + tanh.numpy(beta * c)
reduction = reduction_op(D, "row", "Add", args.threadblock_shape[1])
return D, reduction
epilogue_functor = RowReduction_(
epilogue_functor, tile_description, math_inst.element_accumulator,
C.alignment, element_epilogue, C.element)
epilogue_functor.initialize()
cta_n = args.threadblock_shape[1]
num_cta_n = (problem_size.n() + cta_n - 1) // cta_n
reduction = np.zeros(shape=(args.batch * problem_size.m() * num_cta_n,), dtype=getattr(np, element_c))
# get output op
output_op = operation.epilogue_type(
D=tensor_D, alpha=args.alpha, beta=args.beta, c=tensor_C, reduction=reduction, problem_size=[problem_size.m(), problem_size.n()]
)
Like other epilogue functors such as LinearCombination
, the output op for EpilogueVisitorTree is also created with operation.epilogue_type(*)
. However, there are two differences:
- The arguments need to be passed as keyword-arguments. The keywords are the argument names in
def __call__
. - An additional
problem_size=[problem_size.m(), problem_size.n()]
is required.
Add new Unary Operation (e.g. Activation Function)
To add additional unary operation into epilogue visitor tree, a new unary op
should be created for VisitorOpUnary
. We will take tanh
as an example.
Step 1: define TanhVisitor
The visitor defines the parameters and computation required by the unary option. The unary operations are registered in pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/unary_ops.h. But you can define it in any header file and include the header file in pycutlass/src/cpp/include/epilogue/epilogue_visitor_op/visitor_op_unary.h.
- Two template arguments are required:
T
: data type used to compute the unary operationN
: compute fragment length
- We also need to provide the
Arguments
andParams
structures. TheArguments
will be assembled by ctypes, theParams
will be generated fromArguments
automatically. If the unary function takes no argument, an integer likeint tmp
can be provide to ensure the correctness of ctypes. - The constructor can only take the
params
as the single argument. - The operation is defined in
Array<T, N> operator()(Array<T, N> const &frag) const
. On common way to do that is first define a scalar computation, and them use it for the fragment computation with an unrolled for-loop. - A guard function is required. If it returns
true
, it will disable all the children nodes of the unary node and return zeros to parent node. This is very helpful for multiplication with scalar while the scalar is0
. For general cases, you can just returntrue
.
// T: data type used to compute the unary operation
// N: compute fragment length
template <typename T, int N>
struct TanhVisitor {
/// Argument
struct Arguments {
// a placeholder argument to ensure correctness of ctypes
int tmp;
CUTLASS_HOST_DEVICE
Arguments(): tmp(0) { };
CUTLASS_HOST_DEVICE
Arguments(int tmp): tmp(tmp) { };
};
/// Param
struct Params {
CUTLASS_HOST_DEVICE
Params(){ };
Params(Arguments const &args) { }
};
/// Constructor
CUTLASS_HOST_DEVICE
TanhVisitor(Params const ¶ms) { }
// scalar operator
CUTLASS_HOST_DEVICE
T tanh_op(T const &scalar) const {
return fast_tanh(scalar);
}
/// vector operator
CUTLASS_HOST_DEVICE
Array<T, N> operator()(Array<T, N> const &frag) const {
Array<T, N> y;
CUTLASS_PRAGMA_UNROLL
for (int i=0; i < N; ++i) {
y[i] = tanh_op(frag[i]);
}
return y;
}
// Guard
CUTLASS_HOST_DEVICE
bool guard() {
return true;
}
};
Step 2: register Tanh function
After defining the function in C++, we need to register it in python. The class below gives an example.
- The init function takes the data type
element_compute
, which will be theT
in the C++ template. In the init function, we also generate the_Arguments
class as actypes.Structure
. It includes all the data members in theTanhVisitor::Arguments
. - The
_Arguments
need to be registered asself.argument_type
oftanh
class. - A
emit
function is required to emit the namespace and typename ofTanhVisitor
. - A staticmethod as numpy reference is required to implement the python code to parse.
The built-in functions are defined in pycutlass/src/pycutlass/epilogue.py. You can defined yours in any file as long as it can be found by /pycutlass/src/pycutlass/parser.py.
class tanh(ActivationFunctor):
def __init__(self, element_compute) -> None:
super().__init__()
class _Arguments(ctypes.Structure):
_fields_ = [
("tmp", ctypes.c_int)
]
def __init__(self, *args) -> None:
self.tmp = 0
self.argument_type = _Arguments
def emit(self):
return "cutlass::TanhVisitor"
@staticmethod
def numpy(x: np.ndarray):
return np.tanh(x)
Step 3: Run the function
Now the new unary op is ready to use. An epilogue visitor tree can be built with
class RowReduction_(EpilogueVisitTree):
def __call__(
self, accum: NDArray['tensor', 'float32'], c: NDArray['tensor', 'float32'],
alpha: 'float32', beta: 'float32'):
#
D = alpha * accum + tanh.numpy(beta * c)
reduction = reduction_op(D, "row", "Add", args.threadblock_shape[1])
return D, reduction
epilogue_functor = RowReduction_(
epilogue_functor, tile_description, math_inst.element_accumulator,
C.alignment, element_epilogue, C.element)
epilogue_functor.initialize()
Limitations and Future work
Although the Epilogue Visitor Tree brings great flexibility to epilogue construction, as the epilogue is formulated as a single tree, there are several limitations.
-
[Future Work] Serial and Parallel Split-K GEMM are not supported yet.
- To support serial split-k, additional tree transformation pass is required to inject a
binaryOpNode(Add)
+TensorInputNode
before eachTensorOutputNode
to fetch the partial sum back. Thesemaphore
also needs to be passed into epilogue. - To support parallel split-k, an Reduction with visitor kernel is required.
- To support serial split-k, additional tree transformation pass is required to inject a
-
[Future Work] Convolution and GEMM Grouped are not supported yet.
- To support Conv2d and GEMM Grouped, corresponding *_with_visitor kernels are required.
-
[Limitation] If the same node is used by two operations (except that one of them is reduction), the node and all its offsprings will be executed twice.
-
[Limitation] The result of reduction can only be used as the return value.