diff --git a/README.md b/README.md index ad859f04..b1d4e2d8 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication (GEMM) at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar -to those used to implement cuBLAS. CUTLASS decomposes these “moving parts” into +to those used to implement cuBLAS. CUTLASS decomposes these "moving parts" into reusable, modular software components abstracted by C++ template classes. These thread-wide, warp-wide, block-wide, and device-wide primitives can be specialized and tuned via custom tiling sizes, data types, and other algorithmic policy. The @@ -20,6 +20,13 @@ point (FP64) types. Furthermore, CUTLASS demonstrates CUDA's WMMA API for targe the programmable, high-throughput _Tensor Cores_ provided by NVIDIA's Volta architecture and beyond. +![ALT](/media/fig-09-complete-hierarchy.png "Relative performance of CUTLASS and cuBLAS for large matrices") + +CUTLASS is very efficient, with performance comparable to cuBLAS for scalar GEMM +computations. The above figure shows CUTLASS performance relative to cuBLAS +compiled with CUDA 9.0 running on an NVIDIA Tesla V100 GPU for large matrix +dimensions (M=10240, N=K=4096). + For more exposition, see our Parallel Forall blog post ["CUTLASS: Fast Linear Algebra in CUDA C++"](https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra-cuda). diff --git a/media/cutlass-performance-plot.png b/media/cutlass-performance-plot.png new file mode 100644 index 00000000..96171ed0 Binary files /dev/null and b/media/cutlass-performance-plot.png differ