cutlass

Author	SHA1	Message	Date
Manish Gupta	660a05f581	fix split_k_mode and add reduction kernel for f16 input/accum/output (#896 )	2023-03-30 15:31:08 -04:00
Vijay Thakkar	15d9d31f1f	CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (#897 )	2023-03-29 10:42:40 -04:00
Alexander Pivovarov	7e370c9637	Fix typos 2 (#842 ) Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2023-03-09 23:22:56 -05:00
ANIKET SHIVAM	c4f6b8c6bc	Updates for 3.0 (#857 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-03-09 15:27:40 -05:00
Yinghai Lu	a68e2f95f0	Reduce versbosity in manifest.py (#845 )	2023-03-07 11:53:01 -05:00
Haicheng Wu	65688c2a87	streamk fix (#836 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-02-23 16:35:08 -05:00
Shuai Shao	9cdbe33570	Add fixed_channel and few_channel mode to int8 in generator (#829 )	2023-02-21 21:15:39 -05:00
Vijay Thakkar	277bd6e537	CUTLASS 3.0.0 (#786 ) * CUTLASS 3.0.0	2023-01-23 20:55:28 -05:00
ANIKET SHIVAM	66d9cddc83	New updates for 2.11 (#775 ) * New updates. * Minor profiler updates Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2023-01-20 16:32:57 -05:00
Haicheng Wu	764b840d6f	streamk example and performance tuning (#760 ) * streamk example and performance tuning * one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2023-01-10 16:10:02 -05:00
Jack Kosaian	df81d847d7	Make Python interface work for non-SM80 targets (#726 ) * Make Python interface work for non-SM80 targets * Remove line in README	2022-12-07 21:53:33 -05:00
Haicheng Wu	9f1f37aa21	misc (#719 ) * misc * minor Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-12-05 12:07:20 -05:00
Aditya Atluri	c975e2ccbb	releaase 2.11 (#703 )	2022-11-19 09:02:15 -05:00
seventh	168ea8b0e1	ensure singleton::get thread safe construct instance (#658 ) * ensure singleton::get thread safe construct instance * fix singleton return reference Co-authored-by: xuweiqi <xuweiqi117@gmail.com>	2022-11-08 21:44:32 -05:00
Jack Kosaian	8c1bf9b784	Bump CUTLASS Python container version (#672 ) * Update example 40 README * Update CUTLASS Python README	2022-10-22 21:09:39 -04:00
Alexander Freudenberg	cb539dab78	Correct typos in comments (#639 ) * Correct typos in comments Correct comments in code on type of generated distribution. Improve Gaussian RNG to take advantage of Box Muller method * Inline Box Muller Added inline function for the Box Muller algorithm and updated code comments to be more concise * Update tensor_fill.h * Update tensor_fill.h * small changes to pass tests Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-09-30 22:51:30 -04:00
Haicheng Wu	97bff52e8c	add two missing files (#636 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-09-21 15:42:42 -04:00
ANIKET SHIVAM	e773429f7e	CUTLASS 2.10 updates (#622 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-09-12 21:26:30 -04:00
ANIKET SHIVAM	b72cbf957d	CUTLASS 2.10 (#615 ) Co-authored-by: Aniket Shivam <ashivam@nvidia.com>	2022-09-03 18:48:46 -04:00
Ivan Komarov	0b8cacd6f1	Remove redundant <fstream> includes (#563 ) * Remove redundant <fstream> includes * Fix fstream in examples/ * Fix <fstream> in test/ * Use consistent order for <fstream> (always after <iostream>) * Remove an unneeded include in a file where std::ofstream usage is commented out Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>	2022-07-19 15:23:54 -04:00
Mike Iovine	c4cf0dad82	Fix init-self compiler warnings (#493 ) Fix a few errors caused by trying to initialize a class member with itself. These errors can turn into errors if you compile with `-Winit-self`.	2022-05-11 00:35:28 -04:00
Haicheng Wu	1604ebaf10	Update generator.py stop generating analytical conv kernels to reduce kernel number	2022-05-08 21:47:15 -04:00
Haicheng Wu	ec2b4fd85d	b2b bias vector support (#482 ) * b2b bias vector support * add files Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-04-30 04:16:15 -07:00
Andrei Alexandrescu	d7b499deff	Fix CUDA_PERROR_EXIT and print failing expression (#446 ) `CUDA_PERROR_EXIT ` can lead to incorrect usage (see e.g. [this description](https://www.cs.technion.ac.il/users/yechiel/c++-faq/macros-with-if.html)) because it contains an incomplete `if` expression. Consider: ``` if (condition) CUDA_PERROR_EXIT(cudaFree(x)) else free(x); ``` The author of the code forgot to add a semicolon after the macro. In that case, the `else` will bind to the `if` inside the macro definition, leading to code that the author did not intend or expect. It the author does use a semicolon, the code will not compile, which is awkward. The change adds a `do while` around the `if`, which always requires a semicolon. This PR also adds the text of the failing expression to the printed error message.	2022-04-24 16:29:43 -04:00
Fujun Han	4c0d6e1eb4	[BUGFIX]: Force unroll a loop that doesn't have compilation constant (#441 ) loop times is dangerous. Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2022-04-24 16:28:32 -04:00
Andrew Kerr	12f4108ac2	CUTLASS 2.9 (#468 )	2022-04-23 15:02:38 -04:00
Minmin Sun (孙敏敏)	eb0d4c9213	[library] pass pointer of arguments to get_host_workspace_size() in gemm_universal() (#412 ) Otherwise GemmUniversalOperation::get_host_workspace_size() will fail on SegmentFault.	2022-03-22 12:36:34 -04:00
Yuanqiang Liu	3ab1eacf09	Fix typo in profiler examples (#437 )	2022-03-21 12:00:13 -04:00
Fujun Han	1e4703cbab	Support parallel split K mode for porfiling (#277 ) * Support parallel split K mode for porfiling Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * Parallel Split K support 1. find gemm kernel by preference key 2. switch m n for redution kernel Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * parallel splitk for fp16 gemm * add one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-01-27 10:37:37 -05:00
Masahiro Masuda	d7c9cbf0b9	Fix typo in scripts/library.py (wrong data size for u8) (#393 )	2022-01-07 13:29:56 -05:00
Haicheng Wu	f78994bb40	add the missing pieces (#392 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2021-12-25 04:29:54 -08:00
Andrew Kerr	ec4f7e5194	Updates to fused epilogue (#383 ) * Enhancements and fixes to fused GEMM and Convolution epilogue. * Need to explicitly list cudart as unit test library dependency.	2021-12-17 16:04:43 -05:00
Manish Gupta	808c25337a	CUTLASS 2.8 (#363 ) CUTLASS 2.8	2021-11-19 13:26:35 -08:00
Manish Gupta	2e07c4cc2f	CUTLASS 2.7 (#318 ) CUTLASS 2.7 Mainloop fusion for GEMM: summation over A or B Strided DGRAD (optimized iterators) Half-precision GELU_taylor activation functions Use these when accumulation and epilogue compute types are all cutlass::half_t Tuning and bug fixes to fused GEMM + GEMM example Support for smaller than 128b aligned Convolutions: see examples Caching of results to accelerate Convolution unit tests Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF Corrections and bug fixes reported by the CUTLASS community Thank you for filing these issues! authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com	2021-09-20 11:02:22 -07:00
Haicheng Wu	59e2aa505a	refine the implementation	2021-09-08 13:14:08 +00:00
Manish Gupta	6c2f8f2fb8	CUTLASS 2.6.1 - functional and performance enhancements to strided DGRAD, fixes, and tuning * cutlass 2.6 update * remove debug prints * cutlass 2.6.1 (minor update) * Updated CHANGELOG. * Minor edit to readme to indicate patch version. * Minor edit to readme. Co-authored-by: Haicheng Wu <haichengw@nvidia.com>, Andrew Kerr <akerr@nvidia.com>	2021-09-03 10:26:15 -07:00
Haicheng Wu	68a078fbbf	cleanup	2021-07-30 11:27:21 -07:00
Haicheng Wu	10709dbb64	clean profiler cmd and doc	2021-07-30 11:02:17 -07:00
Manish Gupta	1ac4559d12	Cutlass 2.6 Update 1 (#301 ) * cutlass 2.6 update * remove debug prints	2021-07-27 17:58:30 -07:00
Manish Gupta	e5d51840e8	CUTLASS 2.6 (#298 ) CUTLASS 2.6	2021-07-23 00:40:53 -04:00
Bernardo Covas	1d8372a8e2	fix typo in reference conv3d	2021-05-28 21:06:59 +01:00
Zheng Zeng	a8f6f8eb07	add a missing 'device_memory::' before a function	2021-04-25 20:05:39 +08:00
Manikandan Ananth	75a4737cfe	Fix for public issue #211 - Add a slice-K tile size to the profiler - fix num warps calculations in implicit gemm header	2021-04-01 14:42:00 -07:00
Haicheng Wu	34a42e5620	Update generator.py (#192 )	2021-03-02 12:21:48 -08:00
Andrew Kerr	0e13748649	CUTLASS 2.5	2021-02-26 09:58:26 -05:00
Manish Gupta	6615010cd0	CUTLASS 2.4 (Implicit GEMM convolution) (#147 ) CUTLASS 2.4 (Implicit GEMM Convolution) Co-authored-by: Manish Gupta <manigupta@nvidia.com>, Haicheng Wu <haichengw@nvidia.com>, Dustyn Blasig <dblasig@nvidia.com>, Andrew Kerr <akerr@nvidia.com>	2020-11-19 21:25:25 -08:00
Andrew Kerr	c53f3339bb	CUTLASS 2.3 initial commit (#134 ) CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.	2020-09-23 14:00:58 -07:00
Andrew Kerr	1ab1027954	Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. (#100 ) - Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. - Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out - Added test_examples target to build and test all CUTLASS examples - Minor edits to documentation to point to GTC 2020 webinar	2020-06-15 10:47:01 -07:00
Andrew Kerr	86931fef85	CUTLASS 2.2 (#96 ) Adds support for NVIDIA Ampere Architecture features. CUDA 11 Toolkit recommended.	2020-06-08 16:17:35 -07:00
Vijay Thakkar	e33d90b361	update tools/library/CMakeLists to require python 3.6 according to #70 (#82 ) #70 only updates the documentation. This commit reflects this bump in python version to the CMake configuration as well.	2020-04-08 10:54:36 -07:00

1 2

66 Commits