Update README.md
This commit is contained in:
parent
6f091f5620
commit
5bd3f09312
12
README.md
12
README.md
@ -20,15 +20,17 @@ point (FP64) types. Furthermore, CUTLASS demonstrates CUDA's WMMA API for targe
|
|||||||
the programmable, high-throughput _Tensor Cores_ provided by NVIDIA's Volta architecture
|
the programmable, high-throughput _Tensor Cores_ provided by NVIDIA's Volta architecture
|
||||||
and beyond.
|
and beyond.
|
||||||
|
|
||||||
|
For more exposition, see our Parallel Forall blog post ["CUTLASS: Fast Linear Algebra
|
||||||
|
in CUDA C++"](https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra-cuda).
|
||||||
|
|
||||||
|
# Performance
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
CUTLASS is very efficient, with performance comparable to cuBLAS for scalar GEMM
|
CUTLASS is very efficient, with performance comparable to cuBLAS for scalar GEMM
|
||||||
computations. The above figure shows CUTLASS performance relative to cuBLAS
|
computations. The above figure shows CUTLASS performance relative to cuBLAS
|
||||||
compiled with CUDA 9.0 running on an NVIDIA Tesla V100 GPU for large matrix
|
for large matrix dimensions (M=10240, N=K=4096) running on an NVIDIA Tesla V100 GPU
|
||||||
dimensions (M=10240, N=K=4096).
|
when compiled with CUDA 9.0.
|
||||||
|
|
||||||
For more exposition, see our Parallel Forall blog post ["CUTLASS: Fast Linear Algebra
|
|
||||||
in CUDA C++"](https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra-cuda).
|
|
||||||
|
|
||||||
# Project Structure
|
# Project Structure
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user