Added examples to enable the unity build (#102)
* Updated documentation of fused GEMM example and removed UNITY BUILD batch size. The default batch size when unity build is enabled tends to be favorable.
This commit is contained in:
parent
1ab1027954
commit
fd7e058d0c
@ -22,8 +22,32 @@
|
||||
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||
*
|
||||
**************************************************************************************************/
|
||||
/*
|
||||
|
||||
This example shows fusing two GEMM mainloops into one kernel. The first GEMM computes relu(alpha*A*B) and
|
||||
the second GEMM computes relu(alpha*A*B+beta*C). The performance measuring environment compares against
|
||||
two unfused GEMM operations, demonstrating a speedup of the fused kernel on the
|
||||
NVIDIA Turing GPU architecture.
|
||||
|
||||
Problem size:
|
||||
|
||||
GEMM1 (M,N,K): 128*1600, 64, 576
|
||||
GEMM2 (M,N,K): 128*1600, 128, 64
|
||||
|
||||
Note that GEMM1_N = GEMM2_K
|
||||
|
||||
The example requires the number of threadblocks be the same across 2 GEMMs and
|
||||
thread_block_tile_N = problem_N so the data required by each layer is threadblock-resident. It
|
||||
also requires warp_tile_N = thread_block_tile_N so the data required by each warp is
|
||||
register-file-resident.
|
||||
|
||||
Performance:
|
||||
|
||||
- fp16 on Tesla T4 @ 1590MHz (non-fused vs. fused): 1.39011 ms vs. 1.26035 ms
|
||||
- int8 on Tesla T4 @ 1590MHz (non-fused vs. fused): 0.751759 ms vs. 0.62971 ms
|
||||
- fp16 on Quadro RTX 8000 @ 1890MHz (non-fused vs. fused): 0.721144 ms vs. 0.629864 ms
|
||||
- int8 on Quadro RTX 8000 @ 1890MHz (non-fused vs. fused): 0.379049 ms vs. 0.324764 ms
|
||||
|
||||
/**
|
||||
*/
|
||||
|
||||
#include "b2b_gemm_f16t_f16n_f16t_tensor_op_f16_sm75.h"
|
||||
|
@ -15,10 +15,12 @@ $ make cutlass_profiler -j
|
||||
To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
|
||||
To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
|
||||
...
|
||||
$ make cutlass_profiler -j
|
||||
```
|
||||
Enabling the unity build places multiple kernel instances in one compilation unit, thereby reducing size of the compiled
|
||||
binary and avoiding linker limitations on some platforms.
|
||||
|
||||
The CUTLASS Profiler sources are stored in
|
||||
```bash
|
||||
|
@ -403,7 +403,7 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=sgemm
|
||||
Compling only the kernels desired reduces compilation time.
|
||||
|
||||
To instantiate kernels of all tile sizes, data types, and alignment constraints, specify
|
||||
`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
|
||||
`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
|
||||
|
||||
Several recipes are defined below for convenience. They may be combined as a comma-delimited list.
|
||||
|
||||
@ -412,9 +412,12 @@ Several recipes are defined below for convenience. They may be combined as a com
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
|
||||
```
|
||||
|
||||
**Example.** All kernels for NVIDIA Volta, Turing, and Ampere architectures.
|
||||
**Example.** All kernels for NVIDIA Volta, Turing, and Ampere architectures. Enabling
|
||||
the "unity build" instantiates multiple kernel instances in each compilation unit, thereby
|
||||
reducing binary size and avoiding linker limitations on some platforms.
|
||||
```bash
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all
|
||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all \
|
||||
-DCUTLASS_UNITY_BUILD_ENABLED=ON
|
||||
```
|
||||
|
||||
**Example.** All GEMM kernels targeting Turing Tensor Cores.
|
||||
|
Loading…
Reference in New Issue
Block a user