cutlass

Author	SHA1	Message	Date
KeDengMS	0b74c8f473	Address CR	2021-04-19 23:36:06 +00:00
KeDengMS	83036ed646	More clean up	2021-04-18 04:29:20 +00:00
KeDengMS	b7e43f5eb9	Clean up	2021-04-18 04:24:25 +00:00
KeDengMS	5c62d892fa	Add test	2021-04-18 04:09:34 +00:00
KeDengMS	41a31b404b	Fixes to Gelu for half and fusion	2021-04-17 22:10:19 +00:00
Peter Han	7320aee17d	Typo fix issue#236 Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-04-15 15:08:35 +08:00
Peter Han	2142a05d9d	tranpose.h update based on issue#233 1. Add 'pragma once' preprocess directive 2. Replace prmt PTX with __byte_perm intrinsic Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-04-14 19:58:00 +08:00
Haicheng Wu	c77a524459	Merge pull request #230 from mani-ananth/master Fix for issue #221	2021-04-09 14:45:55 -04:00
Manikandan Ananth	fac6680f31	Merge branch 'master' of github.com:NVIDIA/cutlass	2021-04-09 11:36:31 -07:00
Manikandan Ananth	08993707da	fixing functional bug in fused epilogue	2021-04-09 11:36:03 -07:00
Haicheng Wu	c805593ebe	Merge pull request #228 from mani-ananth/master Fix for issue#224 and issue#225	2021-04-08 10:08:13 -04:00
Manikandan Ananth	26556d7206	fix a broken sparse gemm example. found by the community.	2021-04-07 13:32:55 -07:00
Manikandan Ananth	4839b6cb61	add 2stage fprop 3d into default file	2021-04-07 13:29:32 -07:00
Haicheng Wu	d97214987a	Merge pull request #220 from Peter9606/wrong-stride-array-definition Bugfix: typo, make reduction device cases passed	2021-04-02 08:43:52 -04:00
Haicheng Wu	b0bbc6d548	Merge pull request #219 from mani-ananth/master Fix for issue #211	2021-04-02 08:42:09 -04:00
Peter Han	7074047a54	Bugfix: typo, make reduction device cases passed Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-04-02 09:35:23 +08:00
Manikandan Ananth	75a4737cfe	Fix for public issue #211 - Add a slice-K tile size to the profiler - fix num warps calculations in implicit gemm header	2021-04-01 14:42:00 -07:00
Haicheng Wu	8a3e4b8d02	Merge pull request #214 from Peter9606/separate-stream-error Bugfix: memsetAsync uses wrong default stream	2021-03-24 12:09:01 -04:00
Peter Han	6a6b4028bd	Revert wrong fix of params.update in GemmUniversalBase Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-03-23 23:20:40 +08:00
Peter Han	92393b2676	Bugfix: memsetAsync uses wrong default stream Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-03-23 21:11:42 +08:00
Haicheng Wu	50bf00e5f2	Merge pull request #193 from Peter9606/public_shape_type_from_Mma_HFMA2 HFMA2 Convolutions for SM60 onwards	2021-03-05 21:38:59 -05:00
Manish Gupta	4cd004ead1	fix test name to optimized and instance large tile sizes to speed unit tests	2021-03-05 13:32:36 -08:00
Peter Han	6c4539e372	Make arch tag of test cases more precisely to SM60 Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-03-05 10:53:26 +08:00
Peter Han	a3639ab1a0	Append fp16 test case to verify Mma_HFMA2 Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-03-04 18:17:57 +08:00
Peter Han	169181f30f	Make Shape public from Mma_HFMA2. Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-03-04 11:05:16 +08:00
Haicheng Wu	0f1056390d	Create PUBLICATIONS.md (#189 )	2021-03-03 11:17:40 -08:00
Haicheng Wu	34a42e5620	Update generator.py (#192 )	2021-03-02 12:21:48 -08:00
Dustyn Blasig	8f09b82b12	Merge pull request #187 from NVIDIA/cutlass_2.5 CUTLASS 2.5.0	2021-02-26 23:56:04 -06:00
Andrew Kerr	200a5a5146	Enabled reduction unit tests.	2021-02-26 15:46:57 -05:00
Andrew Kerr	746b7b3247	Enabled tensor reduction kernels.	2021-02-26 15:32:19 -05:00
Andrew Kerr	abdf16a4d9	Updated release notes.	2021-02-26 13:55:04 -05:00
Andrew Kerr	0e13748649	CUTLASS 2.5	2021-02-26 09:58:26 -05:00
Manish Gupta	ccb697bac7	cutlass 2.4 documentation only update	2020-11-23 06:59:45 -06:00
Yang Wang	e6bcdc60cf	fix broken links (#148 )	2020-11-19 21:46:54 -08:00
Manish Gupta	6615010cd0	CUTLASS 2.4 (Implicit GEMM convolution) (#147 ) CUTLASS 2.4 (Implicit GEMM Convolution) Co-authored-by: Manish Gupta <manigupta@nvidia.com>, Haicheng Wu <haichengw@nvidia.com>, Dustyn Blasig <dblasig@nvidia.com>, Andrew Kerr <akerr@nvidia.com>	2020-11-19 21:25:25 -08:00
Dustyn Blasig	c2b80ad4e4	Merge pull request #135 from NVIDIA/cutlass_2.3_final CUTLASS 2.3.0	2020-09-25 13:25:26 -05:00
akerr	37a8f9e598	CUTLASS 2.3.0 final.	2020-09-25 10:34:46 -07:00
Andrew Kerr	c53f3339bb	CUTLASS 2.3 initial commit (#134 ) CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.	2020-09-23 14:00:58 -07:00
hwu36	4dac7490e6	Typoes (#107 ) * Update splitk_gemm.cu * Update gemm_bias_relu.cu * Update mma_sm75.h	2020-07-13 14:25:52 -07:00
Andrew Kerr	fd7e058d0c	Added examples to enable the unity build (#102 ) * Updated documentation of fused GEMM example and removed UNITY BUILD batch size. The default batch size when unity build is enabled tends to be favorable.	2020-06-17 07:09:18 -07:00
Andrew Kerr	1ab1027954	Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. (#100 ) - Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. - Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out - Added test_examples target to build and test all CUTLASS examples - Minor edits to documentation to point to GTC 2020 webinar	2020-06-15 10:47:01 -07:00
Andrew Kerr	86931fef85	CUTLASS 2.2 (#96 ) Adds support for NVIDIA Ampere Architecture features. CUDA 11 Toolkit recommended.	2020-06-08 16:17:35 -07:00
Vijay Thakkar	e33d90b361	update tools/library/CMakeLists to require python 3.6 according to #70 (#82 ) #70 only updates the documentation. This commit reflects this bump in python version to the CMake configuration as well.	2020-04-08 10:54:36 -07:00
Andrew Kerr	96dab34ad9	CUTLASS 2.1 (#83 ) CUTLASS 2.1 contributes: - BLAS-style host-side API added to CUTLASS Library - Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores - Minor enhancements and bug fixes	2020-04-07 13:51:25 -07:00
Andrew Kerr	7c0cd26d13	Need Python 3.6 to use enum.auto() (#70 )	2019-11-22 09:39:12 -08:00
Andrew Kerr	45ecbc885b	Removed redundant conjugation operations from matrix_traits. (#65 )	2019-11-20 11:27:13 -08:00
Andrew Kerr	8aca98f9a7	Improved formatting, clarity, and content of several documents. (#64 ) * Improved formatting, clarity, and content of several documents.	2019-11-20 10:42:15 -08:00
Dustyn Blasig	f4d9c8f755	Clang GPU compilation requires explicit CUDACC version flags (#63 )	2019-11-20 09:52:11 -08:00
Andrew Kerr	fb335f6a5f	CUTLASS 2.0 (#62 ) CUTLASS 2.0 Substantially refactored for - Better performance, particularly for native Turing Tensor Cores - Robust and durable templates spanning the design space - Encapsulated functionality embodying modern C++11 programming techniques - Optimized containers and data types for efficient, generic, portable device code Updates to: - Quick start guide - Documentation - Utilities - CUTLASS Profiler Native Turing Tensor Cores - Efficient GEMM kernels targeting Turing Tensor Cores - Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands Coverage of existing CUTLASS functionality: - GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs - Volta Tensor Cores through native mma.sync and through WMMA API - Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions - Batched GEMM operations - Complex-valued GEMMs Note: this commit and all that follow require a host compiler supporting C++11 or greater.	2019-11-19 16:55:34 -08:00
Andrew Kerr	b5cab177a9	Performance enhancement for Volta Tensor Cores TN layout (#53 ) * Fixed performance defect with indirect access to pointer array for Volta TensorCores TN arrangement. * Updated patch version and changelog. * Updated patch version and changelog. * Added link to changelog in readme. * Fixed markdown link	2019-07-10 10:54:12 -07:00

... 3 4 5 6 7

322 Commits