Updating readme with relative per chart
This commit is contained in:
parent
e2bf51c3fe
commit
0428c89fd5
@ -5,7 +5,7 @@
|
||||
CUTLASS is a collection of CUDA C++ template abstractions for implementing
|
||||
high-performance matrix-multiplication (GEMM) at all levels and scales within CUDA.
|
||||
It incorporates strategies for hierarchical decomposition and data movement similar
|
||||
to those used to implement cuBLAS. CUTLASS decomposes these “moving parts” into
|
||||
to those used to implement cuBLAS. CUTLASS decomposes these "moving parts" into
|
||||
reusable, modular software components abstracted by C++ template classes. These
|
||||
thread-wide, warp-wide, block-wide, and device-wide primitives can be specialized
|
||||
and tuned via custom tiling sizes, data types, and other algorithmic policy. The
|
||||
@ -20,6 +20,13 @@ point (FP64) types. Furthermore, CUTLASS demonstrates CUDA's WMMA API for targe
|
||||
the programmable, high-throughput _Tensor Cores_ provided by NVIDIA's Volta architecture
|
||||
and beyond.
|
||||
|
||||

|
||||
|
||||
CUTLASS is very efficient, with performance comparable to cuBLAS for scalar GEMM
|
||||
computations. The above figure shows CUTLASS performance relative to cuBLAS
|
||||
compiled with CUDA 9.0 running on an NVIDIA Tesla V100 GPU for large matrix
|
||||
dimensions (M=10240, N=K=4096).
|
||||
|
||||
For more exposition, see our Parallel Forall blog post ["CUTLASS: Fast Linear Algebra
|
||||
in CUDA C++"](https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra-cuda).
|
||||
|
||||
|
BIN
media/cutlass-performance-plot.png
Normal file
BIN
media/cutlass-performance-plot.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 39 KiB |
Loading…
Reference in New Issue
Block a user