cutlass

Author	SHA1	Message	Date
Yujia Zhai	04a9777b87	Softmax (#546 ) * add test layernorm g-mem version * Delete include/configure directory * Delete examples/test_layernorm directory * Update gemm_with_softmax.h * Update gemm_softmax.cu * Update linear_combination.h * Update fast_math.h * remove redundant vars Co-authored-by: yujia.zhai <yujia.zhai@bytedance.com> Co-authored-by: yuzhai <yuzhai@nvidia.com>	2022-07-02 01:19:18 -04:00
Haicheng Wu	e45e773436	Update linear_combination_generic.h (#472 ) add `skip_elementwise_` to support serial splitk in linear_combination_generic.h`	2022-06-28 07:29:38 -04:00
Haicheng Wu	9ab9110168	add leaky relu (#542 ) Authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-06-26 10:07:50 -04:00
Jack Kosaian	fa56763c25	Fix occupancy calculation for grouped GEMM (#532 )	2022-06-18 19:53:59 -04:00
LiuWei	25e26a6e51	fix bugs in linear_combination_generic.h missing include cutlass/epilogue/thread/scale_type.h (#531 )	2022-06-17 23:35:14 -04:00
Pei Sun	dceefe4f64	Increment stride correctly in warp iterator. (#516 ) Co-authored-by: peisun1115 <peis@google.com>	2022-06-06 12:33:36 -04:00
Pei Sun	c3881d097e	Fix a comment about LDSM layout. (#514 ) Co-authored-by: peisun1115 <peis@google.com>	2022-06-04 23:04:00 -04:00
Pei Sun	a29dfb1c63	Fix a bug to increment stride tile correctly (#503 ) * Fix a bug to increment stride tile correctly * Update regular_tile_access_iterator_tensor_op.h Co-authored-by: peisun1115 <peis@google.com> Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2022-06-03 22:54:52 -04:00
Mike Iovine	c4cf0dad82	Fix init-self compiler warnings (#493 ) Fix a few errors caused by trying to initialize a class member with itself. These errors can turn into errors if you compile with `-Winit-self`.	2022-05-11 00:35:28 -04:00
TonyZhao	ddd8f9cf41	update float < int32_t * 4 (#488 ) Co-authored-by: 赵俊涛 <zhaojuntao@zhaojuntaos-MacBook-Pro.local>	2022-05-04 13:36:05 -04:00
Haicheng Wu	ec2b4fd85d	b2b bias vector support (#482 ) * b2b bias vector support * add files Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-04-30 04:16:15 -07:00
Stepan Tezyunichev	86ce09aed1	2.9 fixes for nvrtc (#480 ) * Use platform::is_same instead of std::is_same * Don't hide cuComplex include from nvrtc * Typo fixed * Remove comment rename	2022-04-29 09:06:52 -04:00
Janusz Lisiecki	8c339ac039	Fix compilation in clang (#478 ) - adds missing commas - adjusts misaligned usage of CUTLASS_DEVICE between template declaration and specializations Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>	2022-04-28 14:22:06 -04:00
Haicheng Wu	e49f690fd7	Update linear_combination_generic.h	2022-04-28 14:04:53 -04:00
Stepan Tezyunichev	71def2f084	Use platform:: instead of std::abs and std::conditional (#452 ) * Fixed template struct/class mismatch * Use platform implementation instead of std::abs and std::conditional during nvrtc compilation * Use platform implementation instead of std::abs and std::conditional during nvrtc compilation * Revert absolute_value() usage	2022-04-25 14:40:22 -04:00
Fujun Han	dd77fadc70	Remove redundant offset def and init in shared_load_iterator.h (#456 ) Signed-off-by: Fujun Han <fujun.han@iluvatar.ai>	2022-04-24 16:31:00 -04:00
Stepan Tezyunichev	be4578d517	Fixed template struct/class mismatch (#453 )	2022-04-24 16:30:21 -04:00
Andrew Kerr	12f4108ac2	CUTLASS 2.9 (#468 )	2022-04-23 15:02:38 -04:00
Feng Shijie	dd571f0edb	[style] fix code indentation (#449 ) * [docs] fix typo in media/docs/layout.md * [docs] fix comment error * fix typo in include/cutlass/arch/simd_61.h * fix stride comment errors in TensorLayout * fix indentation	2022-04-03 21:13:17 -04:00
Haojin Yang	bc45e2c023	fixed datatype error of numeric_limit for uint1b_t (#419 ) Co-authored-by: Haojin Yang <haojin.yang@.hpi.uni-potsdam.de>	2022-03-22 12:30:30 -04:00
Janusz Lisiecki	8f1fe7a132	Fix separate compilation `-dc` (#433 ) * Fix separate compilation `-dc` - when cutlass is included in multiple compilation units compiled with `-dc` OOB_NAN_F16x8 device constant is instantiated multiple times causing Multiple definition of '_ZN7cutlass4arch13OOB_NAN_F16x8E' error This PR makes this variable a local constant as it is not modified during runtime Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com> * Fix Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com> * Test GH Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com> * Revert test GH Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>	2022-03-22 12:21:18 -04:00
Feng Shijie	cd39c75e25	Fix typo in docs, code comments (#429 ) * [docs] fix typo in media/docs/layout.md * [docs] fix comment error * fix typo in include/cutlass/arch/simd_61.h * fix stride comment errors in TensorLayout	2022-03-15 21:54:36 -04:00
HouQiming	96a11a1ef3	Removed trivial copy constructors on parameter classes to enable devi… (#366 ) * Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels * Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code * std => platform * fix affine2 * really fix affine2 Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-02-28 21:34:02 -05:00
Ivan Komarov	e96f00586c	Make cutlass::gemm::device::GemmArray usable (#295 ) * Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray * Add a reference to GemmArray to the docs Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>	2022-02-17 20:01:05 -05:00
Jongsoo Park	1db6971a8d	Remove unused gemm_k_iterations in GemmKernel::Params (#406 ) Otherwise we get gemm_k_iterations is uninitialized warnings.	2022-02-16 09:52:45 -05:00
Bing Xu	d0d941efc7	[hardswish] correct implmentation (#403 ) * [hardswish] correct implmentation * seems working * hardswish fp32/fp16x2 optimization * [relu] half2 support * add relu0; add multiply_add_relu0; * cleanup Co-authored-by: Bing Xu <bingxu@fb.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-02-09 14:28:53 -05:00
Andrew Kerr	8a951b2940	Enable convolution with fused epilogue for Volta Tensor Cores (#402 ) * Enabled convolution with epilogue fusion for Volta Tensor Cores. * Compilation fixes * Disabled testing Volta on Ampere architectures.	2022-01-30 23:24:50 -05:00
masahi	c2ee13a0fe	Add epilogue functor for residual block fusion (#391 ) * Add epilogue functor for residual block fusion * Do not run split-k tests when ActivationOp is not Identity * explain TestSplitK param * return early	2021-12-29 22:53:40 -05:00
masahi	dceabd4c5a	Support half precision sigmoid activation (#378 ) * Support half precision sigmoid activation * introduce a vectorized variant using fast_tanh * move the math to fast_math.h * fixed compile * .raw() -> .to_half() Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2021-12-22 14:45:06 -05:00
Andrew Kerr	288af365db	Added missing synchronization to avoid WAR hazards between tiles. (#386 )	2021-12-20 08:34:08 -08:00
masahi	0dc3ba60b3	Refactor GELU and Sigmoid epilogue to use a common template (and add SiLu, Hardswish epilogue) (#379 ) * Support half precision sigmoid activation * introduce a vectorized variant using fast_tanh * refactored sigmoid using the new interface * refactored gelu * add silu activation * add hardswish * remove sigmoid for now * add description to silu and hardswish, and other doc update * Do not ignore Round * use constant N * Set isHeavy = true in sigmoid and silu epilogue	2021-12-18 14:58:15 -05:00
Andrew Kerr	ec4f7e5194	Updates to fused epilogue (#383 ) * Enhancements and fixes to fused GEMM and Convolution epilogue. * Need to explicitly list cudart as unit test library dependency.	2021-12-17 16:04:43 -05:00
Manish Gupta	808c25337a	CUTLASS 2.8 (#363 ) CUTLASS 2.8	2021-11-19 13:26:35 -08:00
reed-lau	3b28642801	fix wmma shape typo	2021-09-28 19:04:09 +08:00
Manish Gupta	2e07c4cc2f	CUTLASS 2.7 (#318 ) CUTLASS 2.7 Mainloop fusion for GEMM: summation over A or B Strided DGRAD (optimized iterators) Half-precision GELU_taylor activation functions Use these when accumulation and epilogue compute types are all cutlass::half_t Tuning and bug fixes to fused GEMM + GEMM example Support for smaller than 128b aligned Convolutions: see examples Caching of results to accelerate Convolution unit tests Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF Corrections and bug fixes reported by the CUTLASS community Thank you for filing these issues! authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com	2021-09-20 11:02:22 -07:00
Haicheng Wu	59e2aa505a	refine the implementation	2021-09-08 13:14:08 +00:00
Haicheng Wu	4e8af93da1	Merge remote-tracking branch 'origin/master' into small_alignment	2021-09-07 20:39:38 +00:00
Manish Gupta	6c2f8f2fb8	CUTLASS 2.6.1 - functional and performance enhancements to strided DGRAD, fixes, and tuning * cutlass 2.6 update * remove debug prints * cutlass 2.6.1 (minor update) * Updated CHANGELOG. * Minor edit to readme to indicate patch version. * Minor edit to readme. Co-authored-by: Haicheng Wu <haichengw@nvidia.com>, Andrew Kerr <akerr@nvidia.com>	2021-09-03 10:26:15 -07:00
Haicheng Wu	598e35401c	Merge remote-tracking branch 'origin/master' into small_alignment	2021-08-16 07:49:08 -07:00
Haicheng Wu	a77c658439	fix epilogue register spill	2021-07-29 14:25:48 -07:00
Manish Gupta	1ac4559d12	Cutlass 2.6 Update 1 (#301 ) * cutlass 2.6 update * remove debug prints	2021-07-27 17:58:30 -07:00
Manish Gupta	e5d51840e8	CUTLASS 2.6 (#298 ) CUTLASS 2.6	2021-07-23 00:40:53 -04:00
Tian Jin	e3c56b0d6b	Update predicated_tile_iterator.h	2021-07-05 12:11:53 -04:00
Tian Jin	4647c57243	Update predicated_tile_iterator.h	2021-07-05 12:06:41 -04:00
Manikandan Ananth	47ebfccbec	bug fixes	2021-06-02 10:08:25 -07:00
Manikandan Ananth	da2f110906	Fixes for public issue #265	2021-05-19 10:16:52 -07:00
Zheng Zeng	a68d7cd6f1	Adds `NoBetaScaling` for `LinearCombination`	2021-05-12 22:23:55 +08:00
Haicheng Wu	f58b843951	Merge pull request #239 from KeDengMS/kedeng/gelu Fixes to Gelu for half and fusion	2021-05-08 12:51:42 -04:00
Haicheng Wu	5fc142296f	Merge pull request #237 from Peter9606/issue_236_typo Typo fix issue#236	2021-05-08 07:51:19 -04:00
Haicheng Wu	233d69aa6d	Merge pull request #235 from Peter9606/issue_233_tranpose_update tranpose.h update based on issue#233	2021-05-07 07:14:30 -04:00
mengchi.hmc	f4b0a33633	add unit test for non int4 load	2021-04-23 14:33:46 +08:00
mengchi.hmc	bb35a3ba6f	support setting load granularity for conv2d fprop	2021-04-22 15:20:57 +08:00
mengchi.hmc	7ec3a87f22	support unalignment input for conv2d fprop stage=2 Fix for issue #242	2021-04-21 14:40:05 +08:00
KeDengMS	0b74c8f473	Address CR	2021-04-19 23:36:06 +00:00
KeDengMS	83036ed646	More clean up	2021-04-18 04:29:20 +00:00
KeDengMS	41a31b404b	Fixes to Gelu for half and fusion	2021-04-17 22:10:19 +00:00
Peter Han	7320aee17d	Typo fix issue#236 Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-04-15 15:08:35 +08:00
Peter Han	2142a05d9d	tranpose.h update based on issue#233 1. Add 'pragma once' preprocess directive 2. Replace prmt PTX with __byte_perm intrinsic Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-04-14 19:58:00 +08:00
Manikandan Ananth	08993707da	fixing functional bug in fused epilogue	2021-04-09 11:36:03 -07:00
Manikandan Ananth	4839b6cb61	add 2stage fprop 3d into default file	2021-04-07 13:29:32 -07:00
Haicheng Wu	d97214987a	Merge pull request #220 from Peter9606/wrong-stride-array-definition Bugfix: typo, make reduction device cases passed	2021-04-02 08:43:52 -04:00
Peter Han	7074047a54	Bugfix: typo, make reduction device cases passed Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-04-02 09:35:23 +08:00
Manikandan Ananth	75a4737cfe	Fix for public issue #211 - Add a slice-K tile size to the profiler - fix num warps calculations in implicit gemm header	2021-04-01 14:42:00 -07:00
Peter Han	6a6b4028bd	Revert wrong fix of params.update in GemmUniversalBase Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-03-23 23:20:40 +08:00
Peter Han	92393b2676	Bugfix: memsetAsync uses wrong default stream Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-03-23 21:11:42 +08:00
Peter Han	169181f30f	Make Shape public from Mma_HFMA2. Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-03-04 11:05:16 +08:00
Andrew Kerr	746b7b3247	Enabled tensor reduction kernels.	2021-02-26 15:32:19 -05:00
Andrew Kerr	0e13748649	CUTLASS 2.5	2021-02-26 09:58:26 -05:00
Manish Gupta	6615010cd0	CUTLASS 2.4 (Implicit GEMM convolution) (#147 ) CUTLASS 2.4 (Implicit GEMM Convolution) Co-authored-by: Manish Gupta <manigupta@nvidia.com>, Haicheng Wu <haichengw@nvidia.com>, Dustyn Blasig <dblasig@nvidia.com>, Andrew Kerr <akerr@nvidia.com>	2020-11-19 21:25:25 -08:00
akerr	37a8f9e598	CUTLASS 2.3.0 final.	2020-09-25 10:34:46 -07:00
Andrew Kerr	c53f3339bb	CUTLASS 2.3 initial commit (#134 ) CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements.	2020-09-23 14:00:58 -07:00
hwu36	4dac7490e6	Typoes (#107 ) * Update splitk_gemm.cu * Update gemm_bias_relu.cu * Update mma_sm75.h	2020-07-13 14:25:52 -07:00
Andrew Kerr	1ab1027954	Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. (#100 ) - Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. - Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out - Added test_examples target to build and test all CUTLASS examples - Minor edits to documentation to point to GTC 2020 webinar	2020-06-15 10:47:01 -07:00
Andrew Kerr	86931fef85	CUTLASS 2.2 (#96 ) Adds support for NVIDIA Ampere Architecture features. CUDA 11 Toolkit recommended.	2020-06-08 16:17:35 -07:00
Andrew Kerr	96dab34ad9	CUTLASS 2.1 (#83 ) CUTLASS 2.1 contributes: - BLAS-style host-side API added to CUTLASS Library - Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores - Minor enhancements and bug fixes	2020-04-07 13:51:25 -07:00
Andrew Kerr	45ecbc885b	Removed redundant conjugation operations from matrix_traits. (#65 )	2019-11-20 11:27:13 -08:00
Andrew Kerr	fb335f6a5f	CUTLASS 2.0 (#62 ) CUTLASS 2.0 Substantially refactored for - Better performance, particularly for native Turing Tensor Cores - Robust and durable templates spanning the design space - Encapsulated functionality embodying modern C++11 programming techniques - Optimized containers and data types for efficient, generic, portable device code Updates to: - Quick start guide - Documentation - Utilities - CUTLASS Profiler Native Turing Tensor Cores - Efficient GEMM kernels targeting Turing Tensor Cores - Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands Coverage of existing CUTLASS functionality: - GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs - Volta Tensor Cores through native mma.sync and through WMMA API - Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions - Batched GEMM operations - Complex-valued GEMMs Note: this commit and all that follow require a host compiler supporting C++11 or greater.	2019-11-19 16:55:34 -08:00

1 2 3 4

177 Commits