cutlass

Author	SHA1	Message	Date
Haicheng Wu	1604ebaf10	Update generator.py stop generating analytical conv kernels to reduce kernel number	2022-05-08 21:47:15 -04:00
Haicheng Wu	6023038bae	add verification of the reduction tensor (#489 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-05-06 10:24:51 -07:00
TonyZhao	ddd8f9cf41	update float < int32_t * 4 (#488 ) Co-authored-by: 赵俊涛 <zhaojuntao@zhaojuntaos-MacBook-Pro.local>	2022-05-04 13:36:05 -04:00
Haicheng Wu	ec2b4fd85d	b2b bias vector support (#482 ) * b2b bias vector support * add files Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-04-30 04:16:15 -07:00
Stepan Tezyunichev	86ce09aed1	2.9 fixes for nvrtc (#480 ) * Use platform::is_same instead of std::is_same * Don't hide cuComplex include from nvrtc * Typo fixed * Remove comment rename	2022-04-29 09:06:52 -04:00
Haicheng Wu	21c1fa3849	add .github (#479 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-04-28 12:36:59 -07:00
Janusz Lisiecki	8c339ac039	Fix compilation in clang (#478 ) - adds missing commas - adjusts misaligned usage of CUTLASS_DEVICE between template declaration and specializations Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>	2022-04-28 14:22:06 -04:00
Haicheng Wu	e49f690fd7	Update linear_combination_generic.h	2022-04-28 14:04:53 -04:00
Haicheng Wu	96dad61a75	Update CHANGELOG.md	2022-04-28 10:52:10 -04:00
Haicheng Wu	cc2ea4c3fc	Update README.md	2022-04-28 10:50:11 -04:00
Andrew Kerr	a0de301283	Used relative paths for includes (#477 )	2022-04-27 12:04:23 -07:00
Haicheng Wu	319a389f42	Update CMakeLists.txt (#473 ) * Update CMakeLists.txt Add 128bit int support if using nvc++ to solve #310 @jeffhammond, would you please give it a try? * Update CMakeLists.txt correct copy paste error	2022-04-27 07:02:26 -07:00
Stepan Tezyunichev	71def2f084	Use platform:: instead of std::abs and std::conditional (#452 ) * Fixed template struct/class mismatch * Use platform implementation instead of std::abs and std::conditional during nvrtc compilation * Use platform implementation instead of std::abs and std::conditional during nvrtc compilation * Revert absolute_value() usage	2022-04-25 14:40:22 -04:00
Masahiro Masuda	70f3ba57f5	Fix typo in shared memory layout description (#471 )	2022-04-24 18:32:13 -04:00
Fujun Han	dd77fadc70	Remove redundant offset def and init in shared_load_iterator.h (#456 ) Signed-off-by: Fujun Han <fujun.han@iluvatar.ai>	2022-04-24 16:31:00 -04:00
Stepan Tezyunichev	be4578d517	Fixed template struct/class mismatch (#453 )	2022-04-24 16:30:21 -04:00
Andrei Alexandrescu	d7b499deff	Fix CUDA_PERROR_EXIT and print failing expression (#446 ) `CUDA_PERROR_EXIT ` can lead to incorrect usage (see e.g. [this description](https://www.cs.technion.ac.il/users/yechiel/c++-faq/macros-with-if.html)) because it contains an incomplete `if` expression. Consider: ``` if (condition) CUDA_PERROR_EXIT(cudaFree(x)) else free(x); ``` The author of the code forgot to add a semicolon after the macro. In that case, the `else` will bind to the `if` inside the macro definition, leading to code that the author did not intend or expect. It the author does use a semicolon, the code will not compile, which is awkward. The change adds a `do while` around the `if`, which always requires a semicolon. This PR also adds the text of the failing expression to the printed error message.	2022-04-24 16:29:43 -04:00
Exusial	310ed81ac3	fix description in example 12. (#444 ) Co-authored-by: Exusial <Exusial>	2022-04-24 16:29:06 -04:00
Fujun Han	4c0d6e1eb4	[BUGFIX]: Force unroll a loop that doesn't have compilation constant (#441 ) loop times is dangerous. Signed-off-by: Peter Han <fujun.han@iluvatar.ai>	2022-04-24 16:28:32 -04:00
Jack Kosaian	167ac54c65	Fix link to Python example (#469 )	2022-04-23 15:37:38 -04:00
Andrew Kerr	12f4108ac2	CUTLASS 2.9 (#468 )	2022-04-23 15:02:38 -04:00
Feng Shijie	dd571f0edb	[style] fix code indentation (#449 ) * [docs] fix typo in media/docs/layout.md * [docs] fix comment error * fix typo in include/cutlass/arch/simd_61.h * fix stride comment errors in TensorLayout * fix indentation	2022-04-03 21:13:17 -04:00
Jianyu Huang	6d0d265047	Update PUBLICATIONS.md (#447 )	2022-04-03 21:03:28 -04:00
Haicheng Wu	f11fa975a5	Update PUBLICATIONS.md @tsuki	2022-03-23 21:04:43 -04:00
Masahiro Masuda	0e71d9b450	Transposed conv2d and wgrad split k examples (#413 ) * add split k wgrad example * wgrad done * begin transposed conv2d example * update transposed conv2d example and add ref check * update doc for conv2d transpose example * add license * add wgrad doc * more clarification on GEMM output type * typo fix * clean up indent * address comments * rename example numbers to 34 and 35 * GEMM -> Implicit GEMM * Revert "rename example numbers to 34 and 35" This reverts commit 551a808c227216e9e38d4472ba8ff020557b8500. * transposed_conv2d is 34 * add compiler and device version check to exit gracefully Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-03-23 14:52:54 -04:00
Minmin Sun (孙敏敏)	eb0d4c9213	[library] pass pointer of arguments to get_host_workspace_size() in gemm_universal() (#412 ) Otherwise GemmUniversalOperation::get_host_workspace_size() will fail on SegmentFault.	2022-03-22 12:36:34 -04:00
Haojin Yang	bc45e2c023	fixed datatype error of numeric_limit for uint1b_t (#419 ) Co-authored-by: Haojin Yang <haojin.yang@.hpi.uni-potsdam.de>	2022-03-22 12:30:30 -04:00
Yang Chen	095cbba57c	Example 23 - Passing correct alpha and beta values with --parallel-split-k (#424 ) When split-k is enabled, we should set alpha to 1 and beta to 0 for the split-k gemm kernel. The fix was from hwu36. I only did fixed some minor typos along with his fix.	2022-03-22 12:27:34 -04:00
Janusz Lisiecki	8f1fe7a132	Fix separate compilation `-dc` (#433 ) * Fix separate compilation `-dc` - when cutlass is included in multiple compilation units compiled with `-dc` OOB_NAN_F16x8 device constant is instantiated multiple times causing Multiple definition of '_ZN7cutlass4arch13OOB_NAN_F16x8E' error This PR makes this variable a local constant as it is not modified during runtime Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com> * Fix Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com> * Test GH Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com> * Revert test GH Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>	2022-03-22 12:21:18 -04:00
Yuanqiang Liu	3ab1eacf09	Fix typo in profiler examples (#437 )	2022-03-21 12:00:13 -04:00
Feng Shijie	cd39c75e25	Fix typo in docs, code comments (#429 ) * [docs] fix typo in media/docs/layout.md * [docs] fix comment error * fix typo in include/cutlass/arch/simd_61.h * fix stride comment errors in TensorLayout	2022-03-15 21:54:36 -04:00
Haicheng Wu	b2e1e97cb1	Update PUBLICATIONS.md ACM Trans on Graphics from nv research.	2022-03-01 22:37:18 -05:00
HouQiming	96a11a1ef3	Removed trivial copy constructors on parameter classes to enable devi… (#366 ) * Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels * Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code * std => platform * fix affine2 * really fix affine2 Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-02-28 21:34:02 -05:00
Ivan Komarov	e96f00586c	Make cutlass::gemm::device::GemmArray usable (#295 ) * Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray * Add a reference to GemmArray to the docs Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>	2022-02-17 20:01:05 -05:00
Jongsoo Park	3cfa5db2a2	Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op… (#407 ) * Actually use float accumulation in gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu As title * Update gemm_f16t_f16t_f16t_wmma_tensor_op_f32_sm70.cu change the missing one Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>	2022-02-16 09:53:21 -05:00
Jongsoo Park	1db6971a8d	Remove unused gemm_k_iterations in GemmKernel::Params (#406 ) Otherwise we get gemm_k_iterations is uninitialized warnings.	2022-02-16 09:52:45 -05:00
Haicheng Wu	b954127297	Update PUBLICATIONS.md @jackkosaian	2022-02-14 16:54:32 -05:00
Bing Xu	d0d941efc7	[hardswish] correct implmentation (#403 ) * [hardswish] correct implmentation * seems working * hardswish fp32/fp16x2 optimization * [relu] half2 support * add relu0; add multiply_add_relu0; * cleanup Co-authored-by: Bing Xu <bingxu@fb.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-02-09 14:28:53 -05:00
Andrew Kerr	8a951b2940	Enable convolution with fused epilogue for Volta Tensor Cores (#402 ) * Enabled convolution with epilogue fusion for Volta Tensor Cores. * Compilation fixes * Disabled testing Volta on Ampere architectures.	2022-01-30 23:24:50 -05:00
Fujun Han	1e4703cbab	Support parallel split K mode for porfiling (#277 ) * Support parallel split K mode for porfiling Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * Parallel Split K support 1. find gemm kernel by preference key 2. switch m n for redution kernel Signed-off-by: Peter Han <fujun.han@iluvatar.ai> * parallel splitk for fp16 gemm * add one missing file Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2022-01-27 10:37:37 -05:00
Dustyn Blasig	c3353add63	Merge pull request #388 from depaulmillz/fix/headersonly Fix utils include not being installed in header only	2022-01-26 14:22:51 -06:00
dePaul Miller	ac8825b941	Minor fix to change from LIBRARY_INIT to LIBRARY	2022-01-26 15:17:46 -05:00
Haicheng Wu	8fd94806e5	Update PUBLICATIONS.md add mlsys 2022 paper.	2022-01-17 00:08:18 -05:00
Masahiro Masuda	d7c9cbf0b9	Fix typo in scripts/library.py (wrong data size for u8) (#393 )	2022-01-07 13:29:56 -05:00
masahi	c2ee13a0fe	Add epilogue functor for residual block fusion (#391 ) * Add epilogue functor for residual block fusion * Do not run split-k tests when ActivationOp is not Identity * explain TestSplitK param * return early	2021-12-29 22:53:40 -05:00
Haicheng Wu	f78994bb40	add the missing pieces (#392 ) Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2021-12-25 04:29:54 -08:00
masahi	dceabd4c5a	Support half precision sigmoid activation (#378 ) * Support half precision sigmoid activation * introduce a vectorized variant using fast_tanh * move the math to fast_math.h * fixed compile * .raw() -> .to_half() Co-authored-by: Haicheng Wu <haichengw@nvidia.com>	2021-12-22 14:45:06 -05:00
dePaul Miller	86fa1dc30b	Fix utils include not being installed in header only	2021-12-21 12:10:26 -05:00
Andrew Kerr	288af365db	Added missing synchronization to avoid WAR hazards between tiles. (#386 )	2021-12-20 08:34:08 -08:00
masahi	0dc3ba60b3	Refactor GELU and Sigmoid epilogue to use a common template (and add SiLu, Hardswish epilogue) (#379 ) * Support half precision sigmoid activation * introduce a vectorized variant using fast_tanh * refactored sigmoid using the new interface * refactored gelu * add silu activation * add hardswish * remove sigmoid for now * add description to silu and hardswish, and other doc update * Do not ignore Round * use constant N * Set isHeavy = true in sigmoid and silu epilogue	2021-12-18 14:58:15 -05:00

1 2 3 4 5

230 Commits