From 537a4bcedfdb8df63459880cc09e84c7c8be70d1 Mon Sep 17 00:00:00 2001 From: Duane Merrill Date: Tue, 5 Dec 2017 22:54:49 -0500 Subject: [PATCH] Update README.md --- README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 4e5f38cc..11fe5a1c 100644 --- a/README.md +++ b/README.md @@ -27,7 +27,8 @@ in CUDA C++"](https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra- ![ALT](/media/cutlass-performance-plot.png "Relative performance of CUTLASS and cuBLAS for large matrices") -CUTLASS is very efficient, with performance comparable to cuBLAS for scalar GEMM +CUTLASS primitives are very efficient. When used to construct device-wide GEMM kernels, +they exhibit performance comparable to cuBLAS for scalar GEMM computations. The above figure shows CUTLASS performance relative to cuBLAS for large matrix dimensions (M=10240, N=K=4096) running on an NVIDIA Tesla V100 GPU when compiled with CUDA 9.0.