cutlass

Author	SHA1	Message	Date
Shuai Shao	ce8597dc14	Fix type bug in conv2d/gemm with broadcast (#796 ) add ElementVector --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-09 20:53:25 -05:00
Vijay Thakkar	277bd6e537	CUTLASS 3.0.0 (#786 ) * CUTLASS 3.0.0	2023-01-23 20:55:28 -05:00
ANIKET SHIVAM	66d9cddc83	New updates for 2.11 (#775 ) * New updates. * Minor profiler updates Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-01-20 16:32:57 -05:00
Aditya Atluri	c975e2ccbb	releaase 2.11 (#703 )	2022-11-19 09:02:15 -05:00
Andrew Kerr	fc9ebc645b	CUTLASS 2.10 bug fixes and minor updates. (#626 )	2022-09-15 16:20:33 -04:00
ANIKET SHIVAM	e773429f7e	CUTLASS 2.10 updates (#622 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-09-12 21:26:30 -04:00
ANIKET SHIVAM	b72cbf957d	CUTLASS 2.10 (#615 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-09-03 18:48:46 -04:00
Ivan Komarov	0b8cacd6f1	Remove redundant <fstream> includes (#563 ) * Remove redundant <fstream> includes * Fix fstream in examples/ * Fix <fstream> in test/ * Use consistent order for <fstream> (always after <iostream>) * Remove an unneeded include in a file where std::ofstream usage is commented out Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>	2022-07-19 15:23:54 -04:00
Jacob He	8a766804ad	Fix doc in testbed_gemm_with_broadcast (#559 )	2022-07-07 09:56:16 -04:00
Jack Kosaian	fa56763c25	Fix occupancy calculation for grouped GEMM (#532 )	2022-06-18 19:53:59 -04:00
Haicheng Wu	6023038bae	add verification of the reduction tensor (#489 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-05-06 10:24:51 -07:00
Andrew Kerr	12f4108ac2	CUTLASS 2.9 (#468 )	2022-04-23 15:02:38 -04:00
Jongsoo Park	3cfa5db2a2	Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op… (#407 ) * Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu As title * Update gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu change the missing one Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2022-02-16 09:53:21 -05:00
Andrew Kerr	8a951b2940	Enable convolution with fused epilogue for Volta Tensor Cores (#402 ) * Enabled convolution with epilogue fusion for Volta Tensor Cores. * Compilation fixes * Disabled testing Volta on Ampere architectures.	2022-01-30 23:24:50 -05:00
masahi	c2ee13a0fe	Add epilogue functor for residual block fusion (#391 ) * Add epilogue functor for residual block fusion * Do not run split-k tests when ActivationOp is not Identity * explain TestSplitK param * return early	2021-12-29 22:53:40 -05:00
Andrew Kerr	ec4f7e5194	Updates to fused epilogue (#383 ) * Enhancements and fixes to fused GEMM and Convolution epilogue. * Need to explicitly list cudart as unit test library dependency.	2021-12-17 16:04:43 -05:00
Manish Gupta	808c25337a	CUTLASS 2.8 (#363 ) CUTLASS 2.8	2021-11-19 13:26:35 -08:00
Manish Gupta	2e07c4cc2f	CUTLASS 2.7 (#318 ) CUTLASS 2.7 Mainloop fusion for GEMM: summation over A or B Strided DGRAD (optimized iterators) Half-precision GELU_taylor activation functions Use these when accumulation and epilogue compute types are all cutlass::half_t Tuning and bug fixes to fused GEMM + GEMM example Support for smaller than 128b aligned Convolutions: see examples Caching of results to accelerate Convolution unit tests Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF Corrections and bug fixes reported by the CUTLASS community Thank you for filing these issues! authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com	2021-09-20 11:02:22 -07:00
Haicheng Wu	59e2aa505a	refine the implementation	2021-09-08 13:14:08 +00:00
Haicheng Wu	4e8af93da1	Merge remote-tracking branch 'origin/master' into small_alignment	2021-09-07 20:39:38 +00:00
Manish Gupta	6c2f8f2fb8	CUTLASS 2.6.1 - functional and performance enhancements to strided DGRAD, fixes, and tuning * cutlass 2.6 update * remove debug prints * cutlass 2.6.1 (minor update) * Updated CHANGELOG. * Minor edit to readme to indicate patch version. * Minor edit to readme. Co-authored-by: Haicheng Wu <haichengw@nvidia.com>, Andrew Kerr <akerr@nvidia.com>	2021-09-03 10:26:15 -07:00
Haicheng Wu	598e35401c	Merge remote-tracking branch 'origin/master' into small_alignment	2021-08-16 07:49:08 -07:00
Manish Gupta	1ac4559d12	Cutlass 2.6 Update 1 (#301 ) * cutlass 2.6 update * remove debug prints	2021-07-27 17:58:30 -07:00
Manish Gupta	e5d51840e8	CUTLASS 2.6 (#298 ) CUTLASS 2.6	2021-07-23 00:40:53 -04:00
mengchi.hmc	f4b0a33633	add unit test for non int4 load	2021-04-23 14:33:46 +08:00
KeDengMS	83036ed646	More clean up	2021-04-18 04:29:20 +00:00
KeDengMS	b7e43f5eb9	Clean up	2021-04-18 04:24:25 +00:00
KeDengMS	5c62d892fa	Add test	2021-04-18 04:09:34 +00:00
Manish Gupta	4cd004ead1	fix test name to optimized and instance large tile sizes to speed unit tests	2021-03-05 13:32:36 -08:00
Peter Han	6c4539e372	Make arch tag of test cases more precisely to SM60 Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-03-05 10:53:26 +08:00
Peter Han	a3639ab1a0	Append fp16 test case to verify Mma_HFMA2 Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-03-04 18:17:57 +08:00
Andrew Kerr	200a5a5146	Enabled reduction unit tests.	2021-02-26 15:46:57 -05:00
Andrew Kerr	746b7b3247	Enabled tensor reduction kernels.	2021-02-26 15:32:19 -05:00
Andrew Kerr	0e13748649	CUTLASS 2.5	2021-02-26 09:58:26 -05:00
Manish Gupta	6615010cd0	CUTLASS 2.4 (Implicit GEMM convolution) (#147 ) CUTLASS 2.4 (Implicit GEMM Convolution) Co-authored-by: Manish Gupta <manigupta@nvidia.com>, Haicheng Wu <haichengw@nvidia.com>, Dustyn Blasig <dblasig@nvidia.com>, Andrew Kerr <akerr@nvidia.com>	2020-11-19 21:25:25 -08:00
Andrew Kerr	c53f3339bb	CUTLASS 2.3 initial commit (#134 ) CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.	2020-09-23 14:00:58 -07:00
Andrew Kerr	86931fef85	CUTLASS 2.2 (#96 ) Adds support for NVIDIA Ampere Architecture features. CUDA 11 Toolkit recommended.	2020-06-08 16:17:35 -07:00
Andrew Kerr	96dab34ad9	CUTLASS 2.1 (#83 ) CUTLASS 2.1 contributes: - BLAS-style host-side API added to CUTLASS Library - Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores - Minor enhancements and bug fixes	2020-04-07 13:51:25 -07:00
Andrew Kerr	fb335f6a5f	CUTLASS 2.0 (#62 ) CUTLASS 2.0 Substantially refactored for - Better performance, particularly for native Turing Tensor Cores - Robust and durable templates spanning the design space - Encapsulated functionality embodying modern C++11 programming techniques - Optimized containers and data types for efficient, generic, portable device code Updates to: - Quick start guide - Documentation - Utilities - CUTLASS Profiler Native Turing Tensor Cores - Efficient GEMM kernels targeting Turing Tensor Cores - Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands Coverage of existing CUTLASS functionality: - GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs - Volta Tensor Cores through native mma.sync and through WMMA API - Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions - Batched GEMM operations - Complex-valued GEMMs Note: this commit and all that follow require a host compiler supporting C++11 or greater.	2019-11-19 16:55:34 -08:00

39 Commits