Updating readme with relative per chart

2017-12-05 22:40:47 -05:00 · 2017-12-05 22:40:47 -05:00 · 0428c89fd5
commit 0428c89fd5
parent e2bf51c3fe
2 changed files with 8 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -5,7 +5,7 @@
 CUTLASS is a collection of CUDA C++ template abstractions for implementing 
 high-performance matrix-multiplication (GEMM) at all levels and scales within CUDA. 
 It incorporates strategies for hierarchical decomposition and data movement similar 
-to those used to implement cuBLAS.  CUTLASS decomposes these “moving parts” into 
+to those used to implement cuBLAS.  CUTLASS decomposes these "moving parts" into 
 reusable, modular software components abstracted by C++ template classes.  These
 thread-wide, warp-wide, block-wide, and device-wide primitives can be specialized 
 and tuned via custom tiling sizes, data types, and other algorithmic policy. The 
@ -20,6 +20,13 @@ point (FP64) types.  Furthermore, CUTLASS demonstrates CUDA's WMMA API for targe
 the programmable, high-throughput _Tensor Cores_ provided by NVIDIA's Volta architecture 
 and beyond.

+![ALT](/media/fig-09-complete-hierarchy.png "Relative performance of CUTLASS and cuBLAS for large matrices")
+
+CUTLASS is very efficient, with performance comparable to cuBLAS for scalar GEMM 
+computations. The above figure shows CUTLASS performance relative to cuBLAS 
+compiled with CUDA 9.0 running on an NVIDIA Tesla V100 GPU for large matrix 
+dimensions (M=10240, N=K=4096). 
+
 For more exposition, see our Parallel Forall blog post ["CUTLASS: Fast Linear Algebra 
 in CUDA C++"](https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra-cuda). 

--- a/media/cutlass-performance-plot.png
+++ b/media/cutlass-performance-plot.png