cutlass

Author	SHA1	Message	Date
Yang Chen	095cbba57c	Example 23 - Passing correct alpha and beta values with --parallel-split-k (#424 ) When split-k is enabled, we should set alpha to 1 and beta to 0 for the split-k gemm kernel. The fix was from hwu36. I only did fixed some minor typos along with his fix.	2022-03-22 12:27:34 -04:00
Janusz Lisiecki	8f1fe7a132	Fix separate compilation `-dc` (#433 ) * Fix separate compilation `-dc` - when cutlass is included in multiple compilation units compiled with `-dc` OOB_NAN_F16x8 device constant is instantiated multiple times causing Multiple definition of '_ZN7cutlass4arch13OOB_NAN_F16x8E' error This PR makes this variable a local constant as it is not modified during runtime Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com> * Fix Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com> * Test GH Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com> * Revert test GH Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>	2022-03-22 12:21:18 -04:00
Yuanqiang Liu	3ab1eacf09	Fix typo in profiler examples (#437 )	2022-03-21 12:00:13 -04:00
Feng Shijie	cd39c75e25	Fix typo in docs, code comments (#429 ) * [docs] fix typo in media/docs/layout.md * [docs] fix comment error * fix typo in include/cutlass/arch/simd_61.h * fix stride comment errors in TensorLayout	2022-03-15 21:54:36 -04:00
Haicheng Wu	b2e1e97cb1	Update PUBLICATIONS.md ACM Trans on Graphics from nv research.	2022-03-01 22:37:18 -05:00
HouQiming	96a11a1ef3	Removed trivial copy constructors on parameter classes to enable devi… (#366 ) * Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels * Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code * std => platform * fix affine2 * really fix affine2 Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-02-28 21:34:02 -05:00
Ivan Komarov	e96f00586c	Make cutlass::gemm::device::GemmArray usable (#295 ) * Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray * Add a reference to GemmArray to the docs Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>	2022-02-17 20:01:05 -05:00
Jongsoo Park	3cfa5db2a2	Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op… (#407 ) * Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu As title * Update gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu change the missing one Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2022-02-16 09:53:21 -05:00
Jongsoo Park	1db6971a8d	Remove unused gemm_k_iterations in GemmKernel::Params (#406 ) Otherwise we get gemm_k_iterations is uninitialized warnings.	2022-02-16 09:52:45 -05:00
Haicheng Wu	b954127297	Update PUBLICATIONS.md @jackkosaian	2022-02-14 16:54:32 -05:00
Bing Xu	d0d941efc7	[hardswish] correct implmentation (#403 ) * [hardswish] correct implmentation * seems working * hardswish fp32/fp16x2 optimization * [relu] half2 support * add relu0; add multiply_add_relu0; * cleanup Co-authored-by: Bing Xu <bingxu@fb.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-02-09 14:28:53 -05:00
Andrew Kerr	8a951b2940	Enable convolution with fused epilogue for Volta Tensor Cores (#402 ) * Enabled convolution with epilogue fusion for Volta Tensor Cores. * Compilation fixes * Disabled testing Volta on Ampere architectures.	2022-01-30 23:24:50 -05:00
Fujun Han	1e4703cbab	Support parallel split K mode for porfiling (#277 ) * Support parallel split K mode for porfiling Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * Parallel Split K support 1. find gemm kernel by preference key 2. switch m n for redution kernel Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * parallel splitk for fp16 gemm * add one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-01-27 10:37:37 -05:00
Dustyn Blasig	c3353add63	Merge pull request #388 from depaulmillz/fix/headersonly Fix utils include not being installed in header only	2022-01-26 14:22:51 -06:00
dePaul Miller	ac8825b941	Minor fix to change from LIBRARY_INIT to LIBRARY	2022-01-26 15:17:46 -05:00
Haicheng Wu	8fd94806e5	Update PUBLICATIONS.md add mlsys 2022 paper.	2022-01-17 00:08:18 -05:00
Masahiro Masuda	d7c9cbf0b9	Fix typo in scripts/library.py (wrong data size for u8) (#393 )	2022-01-07 13:29:56 -05:00
masahi	c2ee13a0fe	Add epilogue functor for residual block fusion (#391 ) * Add epilogue functor for residual block fusion * Do not run split-k tests when ActivationOp is not Identity * explain TestSplitK param * return early	2021-12-29 22:53:40 -05:00
Haicheng Wu	f78994bb40	add the missing pieces (#392 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2021-12-25 04:29:54 -08:00
masahi	dceabd4c5a	Support half precision sigmoid activation (#378 ) * Support half precision sigmoid activation * introduce a vectorized variant using fast_tanh * move the math to fast_math.h * fixed compile * .raw() -> .to_half() Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2021-12-22 14:45:06 -05:00
dePaul Miller	86fa1dc30b	Fix utils include not being installed in header only	2021-12-21 12:10:26 -05:00
Andrew Kerr	288af365db	Added missing synchronization to avoid WAR hazards between tiles. (#386 )	2021-12-20 08:34:08 -08:00
masahi	0dc3ba60b3	Refactor GELU and Sigmoid epilogue to use a common template (and add SiLu, Hardswish epilogue) (#379 ) * Support half precision sigmoid activation * introduce a vectorized variant using fast_tanh * refactored sigmoid using the new interface * refactored gelu * add silu activation * add hardswish * remove sigmoid for now * add description to silu and hardswish, and other doc update * Do not ignore Round * use constant N * Set isHeavy = true in sigmoid and silu epilogue	2021-12-18 14:58:15 -05:00
Andrew Kerr	ec4f7e5194	Updates to fused epilogue (#383 ) * Enhancements and fixes to fused GEMM and Convolution epilogue. * Need to explicitly list cudart as unit test library dependency.	2021-12-17 16:04:43 -05:00
Andrew Kerr	4e666e1dfd	Updated README and added issue templates. (#382 )	2021-12-17 09:26:20 -05:00
Haicheng Wu	3799e12f25	Merge pull request #381 from Peter9606/update-makefile-version Update project version to 2.8.0 in CMakeLists.txt	2021-12-16 21:54:57 -05:00
Peter Han	fc3bc85db8	Update project version to 2.8.0 in CMakeLists.txt Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2021-12-17 02:23:31 +00:00
Matthew Nicely	49c0a58d50	Set theme jekyll-theme-minimal	2021-12-15 14:51:24 -05:00
Andrew Kerr	5fe09c2d67	Updated GEMM performance plot with CUTLASS 2.8 compiled with CUDA 11.5 Toolkit (#375 ) Updated GEMM performance plot with CUTLASS 2.8 compiled using CUDA 11.5 Toolkit. GPUs under test: NVIDIA A100 NVIDIA A2 NVIDIA TitanV NVIDIA GeForce 2080 Ti	2021-12-06 14:21:33 -05:00
Andrew Kerr	6b69c79ac3	Fixed contributor formatting. (#365 )	2021-11-22 11:30:53 -08:00
Andrew Kerr	62e438f450	Listed Matthew Nicely as the CUTLASS product manager.. (#364 )	2021-11-19 17:51:21 -08:00
Manish Gupta	808c25337a	CUTLASS 2.8 (#363 ) CUTLASS 2.8	2021-11-19 13:26:35 -08:00
Haicheng Wu	6fc5008803	Update quickstart.md fix a broken link	2021-11-11 09:53:46 -05:00
Haicheng Wu	a3bcc6981d	Merge pull request #331 from reed-lau/feature/fix-wmma-shape-typo fix wmma shape typo	2021-09-28 10:20:29 -04:00
reed-lau	3b28642801	fix wmma shape typo	2021-09-28 19:04:09 +08:00
Manish Gupta	538592dea4	example 23 gemm operand reduction fusion (#325 )	2021-09-20 13:34:47 -07:00
Manish Gupta	2e07c4cc2f	CUTLASS 2.7 (#318 ) CUTLASS 2.7 Mainloop fusion for GEMM: summation over A or B Strided DGRAD (optimized iterators) Half-precision GELU_taylor activation functions Use these when accumulation and epilogue compute types are all cutlass::half_t Tuning and bug fixes to fused GEMM + GEMM example Support for smaller than 128b aligned Convolutions: see examples Caching of results to accelerate Convolution unit tests Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF Corrections and bug fixes reported by the CUTLASS community Thank you for filing these issues! authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com	2021-09-20 11:02:22 -07:00
Haicheng Wu	9ac255863f	Merge pull request #246 from mengchihe/master support unalignment input for conv2d fprop stage=2 Fix for issue #242	2021-09-08 11:40:53 -04:00
Haicheng Wu	59e2aa505a	refine the implementation	2021-09-08 13:14:08 +00:00
Haicheng Wu	4e8af93da1	Merge remote-tracking branch 'origin/master' into small_alignment	2021-09-07 20:39:38 +00:00
Manish Gupta	6c2f8f2fb8	CUTLASS 2.6.1 - functional and performance enhancements to strided DGRAD, fixes, and tuning * cutlass 2.6 update * remove debug prints * cutlass 2.6.1 (minor update) * Updated CHANGELOG. * Minor edit to readme to indicate patch version. * Minor edit to readme. Co-authored-by: Haicheng Wu <haichengw@nvidia.com>, Andrew Kerr <akerr@nvidia.com>	2021-09-03 10:26:15 -07:00
Haicheng Wu	598e35401c	Merge remote-tracking branch 'origin/master' into small_alignment	2021-08-16 07:49:08 -07:00
Manish Gupta	a01feb93d9	Merge pull request #308 from dongxiao92/patch-1 fix typo in doc	2021-08-08 11:54:42 -07:00
dongxiao	d36f331b44	fix typo in doc fix typo	2021-08-08 16:44:22 +08:00
Haicheng Wu	69abafb85a	Merge pull request #306 from NVIDIA/fix-profiler-cmd-doc Fix profiler cmd doc	2021-07-30 14:36:54 -04:00
Haicheng Wu	68a078fbbf	cleanup	2021-07-30 11:27:21 -07:00
Haicheng Wu	10709dbb64	clean profiler cmd and doc	2021-07-30 11:02:17 -07:00
Manish Gupta	1227351079	Merge pull request #305 from NVIDIA/fix_epilogue_spill fix epilogue register spill	2021-07-29 14:30:11 -07:00
Haicheng Wu	a77c658439	fix epilogue register spill	2021-07-29 14:25:48 -07:00
Haicheng Wu	4516b833ce	Merge pull request #303 from Peter9606/doc_typo Doc typo	2021-07-28 20:49:06 -04:00

1 2 3 4 5

203 Commits