Commit Graph

90 Commits

Author SHA1 Message Date
Manish Gupta
ccb697bac7 cutlass 2.4 documentation only update 2020-11-23 06:59:45 -06:00
Yang Wang
e6bcdc60cf
fix broken links (#148) 2020-11-19 21:46:54 -08:00
Manish Gupta
6615010cd0
CUTLASS 2.4 (Implicit GEMM convolution) (#147)
CUTLASS 2.4 (Implicit GEMM Convolution)

Co-authored-by: Manish Gupta <manigupta@nvidia.com>, Haicheng Wu <haichengw@nvidia.com>, Dustyn Blasig <dblasig@nvidia.com>, Andrew Kerr <akerr@nvidia.com>
2020-11-19 21:25:25 -08:00
Dustyn Blasig
c2b80ad4e4
Merge pull request #135 from NVIDIA/cutlass_2.3_final
CUTLASS 2.3.0
2020-09-25 13:25:26 -05:00
akerr
37a8f9e598 CUTLASS 2.3.0 final. 2020-09-25 10:34:46 -07:00
Andrew Kerr
c53f3339bb
CUTLASS 2.3 initial commit (#134)
CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.
2020-09-23 14:00:58 -07:00
hwu36
4dac7490e6
Typoes (#107)
* Update splitk_gemm.cu

* Update gemm_bias_relu.cu

* Update mma_sm75.h
2020-07-13 14:25:52 -07:00
Andrew Kerr
fd7e058d0c
Added examples to enable the unity build (#102)
* Updated documentation of fused GEMM example and removed UNITY BUILD batch size. The default batch size when unity build is enabled tends to be favorable.
2020-06-17 07:09:18 -07:00
Andrew Kerr
1ab1027954
Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. (#100)
- Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>.
- Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out
- Added test_examples target to build and test all CUTLASS examples
- Minor edits to documentation to point to GTC 2020 webinar
2020-06-15 10:47:01 -07:00
Andrew Kerr
86931fef85
CUTLASS 2.2 (#96)
Adds support for NVIDIA Ampere Architecture features. CUDA 11 Toolkit recommended.
2020-06-08 16:17:35 -07:00
Vijay Thakkar
e33d90b361
update tools/library/CMakeLists to require python 3.6 according to #70 (#82)
#70 only updates the documentation. This commit reflects this bump in python version to the CMake configuration as well.
2020-04-08 10:54:36 -07:00
Andrew Kerr
96dab34ad9
CUTLASS 2.1 (#83)
CUTLASS 2.1 contributes:
- BLAS-style host-side API added to CUTLASS Library
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Minor enhancements and bug fixes
2020-04-07 13:51:25 -07:00
Andrew Kerr
7c0cd26d13
Need Python 3.6 to use enum.auto() (#70) 2019-11-22 09:39:12 -08:00
Andrew Kerr
45ecbc885b
Removed redundant conjugation operations from matrix_traits. (#65) 2019-11-20 11:27:13 -08:00
Andrew Kerr
8aca98f9a7
Improved formatting, clarity, and content of several documents. (#64)
* Improved formatting, clarity, and content of several documents.
2019-11-20 10:42:15 -08:00
Dustyn Blasig
f4d9c8f755 Clang GPU compilation requires explicit CUDACC version flags (#63) 2019-11-20 09:52:11 -08:00
Andrew Kerr
fb335f6a5f
CUTLASS 2.0 (#62)
CUTLASS 2.0

Substantially refactored for

- Better performance, particularly for native Turing Tensor Cores
- Robust and durable templates spanning the design space
- Encapsulated functionality embodying modern C++11 programming techniques
- Optimized containers and data types for efficient, generic, portable device code

Updates to:
- Quick start guide
- Documentation
- Utilities
- CUTLASS Profiler

Native Turing Tensor Cores
- Efficient GEMM kernels targeting Turing Tensor Cores
- Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands

Coverage of existing CUTLASS functionality:
- GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
- Volta Tensor Cores through native mma.sync and through WMMA API
- Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
- Batched GEMM operations
- Complex-valued GEMMs

Note: this commit and all that follow require a host compiler supporting C++11 or greater.
2019-11-19 16:55:34 -08:00
Andrew Kerr
b5cab177a9
Performance enhancement for Volta Tensor Cores TN layout (#53)
* Fixed performance defect with indirect access to pointer array for Volta TensorCores TN arrangement.

* Updated patch version and changelog.

* Updated patch version and changelog.

* Added link to changelog in readme.

* Fixed markdown link
2019-07-10 10:54:12 -07:00
Timmy
eb41735933
Merge pull request #47 from Artem-B/cutlass-1.3-clang
Make CUTLASS compileable with Clang.
2019-05-13 10:52:45 -07:00
Artem Belevich
fb8b3a98b7 Addressed code review comments. 2019-05-10 10:24:52 -07:00
gthomascollignon
d9d357877f Added missing file (#48) 2019-05-09 14:07:52 -07:00
Artem Belevich
e18292db46 Make CUTLASS compileable with Clang.
Requires a recent clang build (r359248 or newer).

Enable compilation with clang with these options:
cmake -DCUDA_COMPILER=clang -DCMAKE_CXX_COMPILER=/path/to/clang++
2019-05-02 11:00:22 -07:00
Timmy
fe3438a3c1 cutlass 1.3.1 (#46)
CUTLASS 1.3.1 patch resolves failing text with NVRTC.
2019-04-19 16:54:52 -07:00
Andrew Kerr
877bdcace6
Cutlass 1.3 Release (#42)
CUTLASS 1.3 Release
- Efficient GEMM kernel targeting Volta Tensor Cores via mma.sync instruction added in CUDA 10.1.
2019-03-20 10:49:17 -07:00
Andrew Kerr
19a9d64e3c
Removed patch version from README.
Removed patch version from README.
2018-12-19 15:20:43 -08:00
Andrew Kerr
80e6f7c860
Merge pull request #38 from NVIDIA/resolve_maxwell
Resolved issue for incorrect SGEMM on Maxwell architecture.
2018-12-19 15:17:41 -08:00
akerr
822b0952cd Resolved issue for incorrect SGEMM on Maxwell architecture. 2018-12-19 15:07:16 -08:00
Andrew Kerr
ed2ed4d667
Merge pull request #33 from NVIDIA/cutlass_1.2
CUTLASS 1.2
2018-10-26 14:59:50 -07:00
Andrew Kerr
4db423c40f
Minor edit to CHANGELOG. 2018-10-26 14:58:31 -07:00
Andrew Kerr
b2bc0d3b79 Updating Doxygen docs 2018-10-26 14:54:58 -07:00
akerr
74df0331f2 CUTLASS 1.2 2018-10-26 14:38:46 -07:00
Andrew Kerr
2332df492e
Merge pull request #30 from NVIDIA/fix_utilities_example
Fixed cutlass_utilities example.
2018-09-29 15:09:18 -07:00
akerr
cfe4b933ef CUDA 9 lacks host-side conversions from float=>half. Instead, we must reinterpret_cast<> from cutlass::half_t => half. 2018-09-29 15:04:20 -07:00
Andrew Kerr
6877595a5e
Merge pull request #28 from NVIDIA/cutlass_1.1
Fixed typeo
2018-09-28 12:59:49 -07:00
Andrew Kerr
69e3709da4
Fixed typeo
Fixed typeo
2018-09-28 12:59:20 -07:00
Andrew Kerr
d419094c28
Merge pull request #26 from NVIDIA/cutlass_1.1
Clarification to README
2018-09-21 11:44:47 -07:00
akerr
1a7ac522f8 Clarification to README 2018-09-20 11:04:03 -07:00
Andrew Kerr
bf6eec53eb
Merge pull request #25 from NVIDIA/cutlass_1.1
Updated CUTLASS.md
2018-09-19 21:33:04 -07:00
akerr
206e38dac5 Updated copyright of CUTLASS.md 2018-09-19 21:31:12 -07:00
Andrew Kerr
d85f6a1cec
Merge pull request #24 from NVIDIA/cutlass_1.1
Cutlass 1.1
2018-09-19 21:16:53 -07:00
akerr
0826572c4c Reduced range of random values to avoid bit-level inconsistencies for large matrices. 2018-09-19 21:11:48 -07:00
akerr
77d1e0ca81 Updated README and CHANGELOG. 2018-09-19 20:42:51 -07:00
akerr
d7137f9c0a Updated doxygen 2018-09-19 14:02:08 -07:00
akerr
461f417b9d Checkpointing CUTLASS 1.1 release. 2018-09-18 16:58:03 -07:00
Andrew Kerr
cf0301e00f
Merge pull request #15 from NVIDIA/release_1.0.1_edits
Minor edits to README and changelog pursuant CUTLASS 1.0.1 patch.
2018-06-26 13:59:01 -07:00
akerr
b9bb0d1a49 Edits to README and changelog pursuant CUTLASS 1.0.1 patch. 2018-06-26 13:57:39 -07:00
Andrew Kerr
e1c4ba501b
Merge pull request #13 from NVIDIA/cutlass_v1.0.1
Cutlass v1.0.1
2018-06-12 08:25:56 -07:00
akerr
c566e83e6d Updated changelog. 2018-06-11 14:54:07 -07:00
akerr
374882be53 Replaced GoogleTest copy with submodule. Added updates to support intra-threadblock reductions. Added tests for same. 2018-06-11 11:47:15 -07:00
akerr
2c496c3e9e Replaced GoogleTest copy with Git submodule. 2018-06-11 11:32:41 -07:00