Added examples to enable the unity build (#102)
* Updated documentation of fused GEMM example and removed UNITY BUILD batch size. The default batch size when unity build is enabled tends to be favorable.
This commit is contained in:
parent
1ab1027954
commit
fd7e058d0c
@ -22,8 +22,32 @@
|
|||||||
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||||
*
|
*
|
||||||
**************************************************************************************************/
|
**************************************************************************************************/
|
||||||
|
/*
|
||||||
|
|
||||||
|
This example shows fusing two GEMM mainloops into one kernel. The first GEMM computes relu(alpha*A*B) and
|
||||||
|
the second GEMM computes relu(alpha*A*B+beta*C). The performance measuring environment compares against
|
||||||
|
two unfused GEMM operations, demonstrating a speedup of the fused kernel on the
|
||||||
|
NVIDIA Turing GPU architecture.
|
||||||
|
|
||||||
|
Problem size:
|
||||||
|
|
||||||
|
GEMM1 (M,N,K): 128*1600, 64, 576
|
||||||
|
GEMM2 (M,N,K): 128*1600, 128, 64
|
||||||
|
|
||||||
|
Note that GEMM1_N = GEMM2_K
|
||||||
|
|
||||||
|
The example requires the number of threadblocks be the same across 2 GEMMs and
|
||||||
|
thread_block_tile_N = problem_N so the data required by each layer is threadblock-resident. It
|
||||||
|
also requires warp_tile_N = thread_block_tile_N so the data required by each warp is
|
||||||
|
register-file-resident.
|
||||||
|
|
||||||
|
Performance:
|
||||||
|
|
||||||
|
- fp16 on Tesla T4 @ 1590MHz (non-fused vs. fused): 1.39011 ms vs. 1.26035 ms
|
||||||
|
- int8 on Tesla T4 @ 1590MHz (non-fused vs. fused): 0.751759 ms vs. 0.62971 ms
|
||||||
|
- fp16 on Quadro RTX 8000 @ 1890MHz (non-fused vs. fused): 0.721144 ms vs. 0.629864 ms
|
||||||
|
- int8 on Quadro RTX 8000 @ 1890MHz (non-fused vs. fused): 0.379049 ms vs. 0.324764 ms
|
||||||
|
|
||||||
/**
|
|
||||||
*/
|
*/
|
||||||
|
|
||||||
#include "b2b_gemm_f16t_f16n_f16t_tensor_op_f16_sm75.h"
|
#include "b2b_gemm_f16t_f16n_f16t_tensor_op_f16_sm75.h"
|
||||||
|
@ -15,10 +15,12 @@ $ make cutlass_profiler -j
|
|||||||
To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
|
To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
|
||||||
To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
|
To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
|
||||||
```bash
|
```bash
|
||||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all
|
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
|
||||||
...
|
...
|
||||||
$ make cutlass_profiler -j
|
$ make cutlass_profiler -j
|
||||||
```
|
```
|
||||||
|
Enabling the unity build places multiple kernel instances in one compilation unit, thereby reducing size of the compiled
|
||||||
|
binary and avoiding linker limitations on some platforms.
|
||||||
|
|
||||||
The CUTLASS Profiler sources are stored in
|
The CUTLASS Profiler sources are stored in
|
||||||
```bash
|
```bash
|
||||||
|
@ -412,9 +412,12 @@ Several recipes are defined below for convenience. They may be combined as a com
|
|||||||
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
|
$ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
|
||||||
```
|
```
|
||||||
|
|
||||||
**Example.** All kernels for NVIDIA Volta, Turing, and Ampere architectures.
|
**Example.** All kernels for NVIDIA Volta, Turing, and Ampere architectures. Enabling
|
||||||
|
the "unity build" instantiates multiple kernel instances in each compilation unit, thereby
|
||||||
|
reducing binary size and avoiding linker limitations on some platforms.
|
||||||
```bash
|
```bash
|
||||||
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all
|
$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all \
|
||||||
|
-DCUTLASS_UNITY_BUILD_ENABLED=ON
|
||||||
```
|
```
|
||||||
|
|
||||||
**Example.** All GEMM kernels targeting Turing Tensor Cores.
|
**Example.** All GEMM kernels targeting Turing Tensor Cores.
|
||||||
|
Loading…
Reference in New Issue
Block a user