From fd7e058d0cb3e4bf743edc530c7778a210cb168b Mon Sep 17 00:00:00 2001
From: Andrew Kerr <akerr@nvidia.com>
Date: Wed, 17 Jun 2020 07:09:18 -0700
Subject: [PATCH] Added examples to enable the unity build (#102)

* Updated documentation of fused GEMM example and removed UNITY BUILD batch size. The default batch size when unity build is enabled tends to be favorable.
---
 examples/13_fused_two_gemms/fused_gemm.cu | 26 ++++++++++++++++++++++-
 media/docs/profiler.md                    |  4 +++-
 media/docs/quickstart.md                  |  9 +++++---
 3 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/examples/13_fused_two_gemms/fused_gemm.cu b/examples/13_fused_two_gemms/fused_gemm.cu
index 8f5d4f2c..a7856abe 100644
--- a/examples/13_fused_two_gemms/fused_gemm.cu
+++ b/examples/13_fused_two_gemms/fused_gemm.cu
@@ -22,8 +22,32 @@
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **************************************************************************************************/
+/*
+
+This example shows fusing two GEMM mainloops into one kernel. The first GEMM computes relu(alpha*A*B) and 
+the second GEMM computes relu(alpha*A*B+beta*C). The performance measuring environment compares against
+two unfused GEMM operations, demonstrating a speedup of the fused kernel on the 
+NVIDIA Turing GPU architecture.
+
+Problem size:
+
+  GEMM1 (M,N,K): 128*1600, 64, 576
+  GEMM2 (M,N,K): 128*1600, 128, 64
+
+Note that GEMM1_N = GEMM2_K
+
+The example requires the number of threadblocks be the same across 2 GEMMs and 
+thread_block_tile_N = problem_N so the data required by each layer is threadblock-resident. It 
+also requires warp_tile_N = thread_block_tile_N so the data required by each warp is 
+register-file-resident.
+
+Performance:
+
+  - fp16 on Tesla T4 @ 1590MHz (non-fused vs. fused): 1.39011 ms vs. 1.26035 ms
+  - int8 on Tesla T4 @ 1590MHz (non-fused vs. fused): 0.751759 ms vs. 0.62971 ms
+  - fp16 on Quadro RTX 8000 @ 1890MHz (non-fused vs. fused): 0.721144 ms vs. 0.629864 ms
+  - int8 on Quadro RTX 8000 @ 1890MHz (non-fused vs. fused): 0.379049 ms vs. 0.324764 ms
 
-/**
 */
 
 #include "b2b_gemm_f16t_f16n_f16t_tensor_op_f16_sm75.h"
diff --git a/media/docs/profiler.md b/media/docs/profiler.md
index ad4c58ab..7d2356c5 100644
--- a/media/docs/profiler.md
+++ b/media/docs/profiler.md
@@ -15,10 +15,12 @@ $ make cutlass_profiler -j
 To limit compilation time, only one tile size (128x128) is instantiated for each data type, math instruction, and layout.
 To instantiate all sizes, set the following environment variable when running CMake from an empty `build/` directory.
 ```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all
+$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all -DCUTLASS_UNITY_BUILD_ENABLED=ON
 ...
 $ make cutlass_profiler -j
 ```
+Enabling the unity build places multiple kernel instances in one compilation unit, thereby reducing size of the compiled
+binary and avoiding linker limitations on some platforms.
 
 The CUTLASS Profiler sources are stored in 
 ```bash
diff --git a/media/docs/quickstart.md b/media/docs/quickstart.md
index f40c41ec..082b4c10 100644
--- a/media/docs/quickstart.md
+++ b/media/docs/quickstart.md
@@ -403,7 +403,7 @@ $ cmake .. -DCUTLASS_NVCC_ARCHS=75 -DCUTLASS_LIBRARY_KERNELS=sgemm
 Compling only the kernels desired reduces compilation time.
 
 To instantiate kernels of all tile sizes, data types, and alignment constraints, specify 
-`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`.
+`-DCUTLASS_LIBRARY_KERNELS=all` when running `cmake`. 
 
 Several recipes are defined below for convenience. They may be combined as a comma-delimited list.
 
@@ -412,9 +412,12 @@ Several recipes are defined below for convenience. They may be combined as a com
 $ cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=tensorop*gemm
 ```
 
-**Example.** All kernels for NVIDIA Volta, Turing, and Ampere architectures.
+**Example.** All kernels for NVIDIA Volta, Turing, and Ampere architectures. Enabling 
+the "unity build" instantiates multiple kernel instances in each compilation unit, thereby
+reducing binary size and avoiding linker limitations on some platforms.
 ```bash
-$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all
+$ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all \
+   -DCUTLASS_UNITY_BUILD_ENABLED=ON
 ```
 
 **Example.** All GEMM kernels targeting Turing Tensor Cores.