From e2bf51c3fe1e4575c3c4937d5faa709e2c93d959 Mon Sep 17 00:00:00 2001
From: Duane Merrill <duane.merrill@gmail.com>
Date: Tue, 5 Dec 2017 22:25:42 -0500
Subject: [PATCH] Update README.md

---
 README.md | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/README.md b/README.md
index 87fd9a6e..ad859f04 100644
--- a/README.md
+++ b/README.md
@@ -20,6 +20,9 @@ point (FP64) types.  Furthermore, CUTLASS demonstrates CUDA's WMMA API for targe
 the programmable, high-throughput _Tensor Cores_ provided by NVIDIA's Volta architecture 
 and beyond.
 
+For more exposition, see our Parallel Forall blog post ["CUTLASS: Fast Linear Algebra 
+in CUDA C++"](https://devblogs.nvidia.com/parallelforall/cutlass-linear-algebra-cuda). 
+
 # Project Structure
 
 CUTLASS is arranged as a header-only library with several example test programs
@@ -56,7 +59,7 @@ transposititions.  Be sure to specify your target architecture.
 
      <s|d|h|i|w>gemm_<nn|nt|tn|tt>
            [--help]
-           [--schmoo || --m=<height> --n=<width> --k=<depth>]
+           [--schmoo=<#schmoo-samples> || --m=<height> --n=<width> --k=<depth>]
            [--i=<timing iterations>]
            [--device=<device-id>]
            [--alpha=<alpha> --beta=<beta>]