mengchi.hmc 
							
						 
					 
					
						
						
						
						
							
						
						
							7ec3a87f22 
							
						 
					 
					
						
						
							
							support unalignment input for conv2d fprop stage=2 Fix for issue  #242  
						
						
						
					 
					
						2021-04-21 14:40:05 +08:00 
						 
				 
			
				
					
						
							
							
								KeDengMS 
							
						 
					 
					
						
						
						
						
							
						
						
							0b74c8f473 
							
						 
					 
					
						
						
							
							Address CR  
						
						
						
					 
					
						2021-04-19 23:36:06 +00:00 
						 
				 
			
				
					
						
							
							
								KeDengMS 
							
						 
					 
					
						
						
						
						
							
						
						
							83036ed646 
							
						 
					 
					
						
						
							
							More clean up  
						
						
						
					 
					
						2021-04-18 04:29:20 +00:00 
						 
				 
			
				
					
						
							
							
								KeDengMS 
							
						 
					 
					
						
						
						
						
							
						
						
							b7e43f5eb9 
							
						 
					 
					
						
						
							
							Clean up  
						
						
						
					 
					
						2021-04-18 04:24:25 +00:00 
						 
				 
			
				
					
						
							
							
								KeDengMS 
							
						 
					 
					
						
						
						
						
							
						
						
							5c62d892fa 
							
						 
					 
					
						
						
							
							Add test  
						
						
						
					 
					
						2021-04-18 04:09:34 +00:00 
						 
				 
			
				
					
						
							
							
								KeDengMS 
							
						 
					 
					
						
						
						
						
							
						
						
							41a31b404b 
							
						 
					 
					
						
						
							
							Fixes to Gelu for half and fusion  
						
						
						
					 
					
						2021-04-17 22:10:19 +00:00 
						 
				 
			
				
					
						
							
							
								Peter Han 
							
						 
					 
					
						
						
						
						
							
						
						
							7320aee17d 
							
						 
					 
					
						
						
							
							Typo fix issue#236  
						
						... 
						
						
						
						Signed-off-by: Peter Han <fujun.han@iluvatar.ai> 
						
					 
					
						2021-04-15 15:08:35 +08:00 
						 
				 
			
				
					
						
							
							
								Peter Han 
							
						 
					 
					
						
						
						
						
							
						
						
							2142a05d9d 
							
						 
					 
					
						
						
							
							tranpose.h update based on issue#233  
						
						... 
						
						
						
						1. Add 'pragma once' preprocess directive
 2. Replace prmt PTX with __byte_perm intrinsic
Signed-off-by: Peter Han <fujun.han@iluvatar.ai> 
						
					 
					
						2021-04-14 19:58:00 +08:00 
						 
				 
			
				
					
						
							
							
								Haicheng Wu 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							c77a524459 
							
						 
					 
					
						
						
							
							Merge pull request  #230  from mani-ananth/master  
						
						... 
						
						
						
						Fix for issue #221  
						
					 
					
						2021-04-09 14:45:55 -04:00 
						 
				 
			
				
					
						
							
							
								Manikandan Ananth 
							
						 
					 
					
						
						
						
						
							
						
						
							fac6680f31 
							
						 
					 
					
						
						
							
							Merge branch 'master' of github.com:NVIDIA/cutlass  
						
						
						
					 
					
						2021-04-09 11:36:31 -07:00 
						 
				 
			
				
					
						
							
							
								Manikandan Ananth 
							
						 
					 
					
						
						
						
						
							
						
						
							08993707da 
							
						 
					 
					
						
						
							
							fixing functional bug in fused epilogue  
						
						
						
					 
					
						2021-04-09 11:36:03 -07:00 
						 
				 
			
				
					
						
							
							
								Haicheng Wu 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							c805593ebe 
							
						 
					 
					
						
						
							
							Merge pull request  #228  from mani-ananth/master  
						
						... 
						
						
						
						Fix for issue#224 and issue#225 
						
					 
					
						2021-04-08 10:08:13 -04:00 
						 
				 
			
				
					
						
							
							
								Manikandan Ananth 
							
						 
					 
					
						
						
						
						
							
						
						
							26556d7206 
							
						 
					 
					
						
						
							
							fix a broken sparse gemm example.  found by the community.  
						
						
						
					 
					
						2021-04-07 13:32:55 -07:00 
						 
				 
			
				
					
						
							
							
								Manikandan Ananth 
							
						 
					 
					
						
						
						
						
							
						
						
							4839b6cb61 
							
						 
					 
					
						
						
							
							add 2stage fprop 3d into default file  
						
						
						
					 
					
						2021-04-07 13:29:32 -07:00 
						 
				 
			
				
					
						
							
							
								Haicheng Wu 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							d97214987a 
							
						 
					 
					
						
						
							
							Merge pull request  #220  from Peter9606/wrong-stride-array-definition  
						
						... 
						
						
						
						Bugfix: typo, make reduction device cases passed 
						
					 
					
						2021-04-02 08:43:52 -04:00 
						 
				 
			
				
					
						
							
							
								Haicheng Wu 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							b0bbc6d548 
							
						 
					 
					
						
						
							
							Merge pull request  #219  from mani-ananth/master  
						
						... 
						
						
						
						Fix for issue #211  
						
					 
					
						2021-04-02 08:42:09 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Han 
							
						 
					 
					
						
						
						
						
							
						
						
							7074047a54 
							
						 
					 
					
						
						
							
							Bugfix: typo, make reduction device cases passed  
						
						... 
						
						
						
						Signed-off-by: Peter Han <fujun.han@iluvatar.ai> 
						
					 
					
						2021-04-02 09:35:23 +08:00 
						 
				 
			
				
					
						
							
							
								Manikandan Ananth 
							
						 
					 
					
						
						
						
						
							
						
						
							75a4737cfe 
							
						 
					 
					
						
						
							
							Fix for public issue  #211  
						
						... 
						
						
						
						- Add a slice-K tile size to the profiler
- fix num warps calculations in implicit gemm header 
						
					 
					
						2021-04-01 14:42:00 -07:00 
						 
				 
			
				
					
						
							
							
								Haicheng Wu 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							8a3e4b8d02 
							
						 
					 
					
						
						
							
							Merge pull request  #214  from Peter9606/separate-stream-error  
						
						... 
						
						
						
						Bugfix: memsetAsync uses wrong default stream 
						
					 
					
						2021-03-24 12:09:01 -04:00 
						 
				 
			
				
					
						
							
							
								Peter Han 
							
						 
					 
					
						
						
						
						
							
						
						
							6a6b4028bd 
							
						 
					 
					
						
						
							
							Revert wrong fix of params.update in GemmUniversalBase  
						
						... 
						
						
						
						Signed-off-by: Peter Han <fujun.han@iluvatar.ai> 
						
					 
					
						2021-03-23 23:20:40 +08:00 
						 
				 
			
				
					
						
							
							
								Peter Han 
							
						 
					 
					
						
						
						
						
							
						
						
							92393b2676 
							
						 
					 
					
						
						
							
							Bugfix: memsetAsync uses wrong default stream  
						
						... 
						
						
						
						Signed-off-by: Peter Han <fujun.han@iluvatar.ai> 
						
					 
					
						2021-03-23 21:11:42 +08:00 
						 
				 
			
				
					
						
							
							
								Haicheng Wu 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							50bf00e5f2 
							
						 
					 
					
						
						
							
							Merge pull request  #193  from Peter9606/public_shape_type_from_Mma_HFMA2  
						
						... 
						
						
						
						HFMA2 Convolutions for SM60 onwards 
						
					 
					
						2021-03-05 21:38:59 -05:00 
						 
				 
			
				
					
						
							
							
								Manish Gupta 
							
						 
					 
					
						
						
						
						
							
						
						
							4cd004ead1 
							
						 
					 
					
						
						
							
							fix test name to optimized and instance large tile sizes to speed unit tests  
						
						
						
					 
					
						2021-03-05 13:32:36 -08:00 
						 
				 
			
				
					
						
							
							
								Peter Han 
							
						 
					 
					
						
						
						
						
							
						
						
							6c4539e372 
							
						 
					 
					
						
						
							
							Make arch tag of test cases more precisely to SM60  
						
						... 
						
						
						
						Signed-off-by: Peter Han <fujun.han@iluvatar.ai> 
						
					 
					
						2021-03-05 10:53:26 +08:00 
						 
				 
			
				
					
						
							
							
								Peter Han 
							
						 
					 
					
						
						
						
						
							
						
						
							a3639ab1a0 
							
						 
					 
					
						
						
							
							Append fp16 test case to verify Mma_HFMA2  
						
						... 
						
						
						
						Signed-off-by: Peter Han <fujun.han@iluvatar.ai> 
						
					 
					
						2021-03-04 18:17:57 +08:00 
						 
				 
			
				
					
						
							
							
								Peter Han 
							
						 
					 
					
						
						
						
						
							
						
						
							169181f30f 
							
						 
					 
					
						
						
							
							Make Shape public from Mma_HFMA2.  
						
						... 
						
						
						
						Signed-off-by: Peter Han <fujun.han@iluvatar.ai> 
						
					 
					
						2021-03-04 11:05:16 +08:00 
						 
				 
			
				
					
						
							
							
								Haicheng Wu 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							0f1056390d 
							
						 
					 
					
						
						
							
							Create  PUBLICATIONS.md ( #189 )  
						
						
						
					 
					
						2021-03-03 11:17:40 -08:00 
						 
				 
			
				
					
						
							
							
								Haicheng Wu 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							34a42e5620 
							
						 
					 
					
						
						
							
							Update generator.py ( #192 )  
						
						
						
					 
					
						2021-03-02 12:21:48 -08:00 
						 
				 
			
				
					
						
							
							
								Dustyn Blasig 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							8f09b82b12 
							
						 
					 
					
						
						
							
							Merge pull request  #187  from NVIDIA/cutlass_2.5  
						
						... 
						
						
						
						CUTLASS 2.5.0 
						
					 
					
						2021-02-26 23:56:04 -06:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
						
						
							
						
						
							200a5a5146 
							
						 
					 
					
						
						
							
							Enabled reduction unit tests.  
						
						
						
					 
					
						2021-02-26 15:46:57 -05:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
						
						
							
						
						
							746b7b3247 
							
						 
					 
					
						
						
							
							Enabled tensor reduction kernels.  
						
						
						
					 
					
						2021-02-26 15:32:19 -05:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
						
						
							
						
						
							abdf16a4d9 
							
						 
					 
					
						
						
							
							Updated release notes.  
						
						
						
					 
					
						2021-02-26 13:55:04 -05:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
						
						
							
						
						
							0e13748649 
							
						 
					 
					
						
						
							
							CUTLASS 2.5  
						
						
						
					 
					
						2021-02-26 09:58:26 -05:00 
						 
				 
			
				
					
						
							
							
								Manish Gupta 
							
						 
					 
					
						
						
						
						
							
						
						
							ccb697bac7 
							
						 
					 
					
						
						
							
							cutlass 2.4 documentation only update  
						
						
						
					 
					
						2020-11-23 06:59:45 -06:00 
						 
				 
			
				
					
						
							
							
								Yang Wang 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							e6bcdc60cf 
							
						 
					 
					
						
						
							
							fix broken links ( #148 )  
						
						
						
					 
					
						2020-11-19 21:46:54 -08:00 
						 
				 
			
				
					
						
							
							
								Manish Gupta 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							6615010cd0 
							
						 
					 
					
						
						
							
							CUTLASS 2.4 (Implicit GEMM convolution) ( #147 )  
						
						... 
						
						
						
						CUTLASS 2.4 (Implicit GEMM Convolution)
Co-authored-by: Manish Gupta <manigupta@nvidia.com>, Haicheng Wu <haichengw@nvidia.com>, Dustyn Blasig <dblasig@nvidia.com>, Andrew Kerr <akerr@nvidia.com> 
						
					 
					
						2020-11-19 21:25:25 -08:00 
						 
				 
			
				
					
						
							
							
								Dustyn Blasig 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							c2b80ad4e4 
							
						 
					 
					
						
						
							
							Merge pull request  #135  from NVIDIA/cutlass_2.3_final  
						
						... 
						
						
						
						CUTLASS 2.3.0 
						
					 
					
						2020-09-25 13:25:26 -05:00 
						 
				 
			
				
					
						
							
							
								akerr 
							
						 
					 
					
						
						
						
						
							
						
						
							37a8f9e598 
							
						 
					 
					
						
						
							
							CUTLASS 2.3.0 final.  
						
						
						
					 
					
						2020-09-25 10:34:46 -07:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							c53f3339bb 
							
						 
					 
					
						
						
							
							CUTLASS 2.3 initial commit ( #134 )  
						
						... 
						
						
						
						CUTLASS 2.3 adds GEMMs targeting Sparse Tensor Cores on the NVIDIA Ampere Architecture, fast SGEMM, and small matrix classes, bug fixes, and performance enhancements. 
						
					 
					
						2020-09-23 14:00:58 -07:00 
						 
				 
			
				
					
						
							
							
								hwu36 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							4dac7490e6 
							
						 
					 
					
						
						
							
							Typoes ( #107 )  
						
						... 
						
						
						
						* Update splitk_gemm.cu
* Update gemm_bias_relu.cu
* Update mma_sm75.h 
						
					 
					
						2020-07-13 14:25:52 -07:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							fd7e058d0c 
							
						 
					 
					
						
						
							
							Added examples to enable the unity build ( #102 )  
						
						... 
						
						
						
						* Updated documentation of fused GEMM example and removed UNITY BUILD batch size. The default batch size when unity build is enabled tends to be favorable. 
						
					 
					
						2020-06-17 07:09:18 -07:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							1ab1027954 
							
						 
					 
					
						
						
							
							Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>. ( #100 )  
						
						... 
						
						
						
						- Updated mma_sm80.h to avoid perf penalty due to reinterpret_cast<>.
- Enhancement to CUTLASS Utility Library's HostTensorPlanarComplex template to support copy-in and copy-out
- Added test_examples target to build and test all CUTLASS examples
- Minor edits to documentation to point to GTC 2020 webinar 
						
					 
					
						2020-06-15 10:47:01 -07:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							86931fef85 
							
						 
					 
					
						
						
							
							CUTLASS 2.2 ( #96 )  
						
						... 
						
						
						
						Adds support for NVIDIA Ampere Architecture features. CUDA 11 Toolkit recommended. 
						
					 
					
						2020-06-08 16:17:35 -07:00 
						 
				 
			
				
					
						
							
							
								Vijay Thakkar 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							e33d90b361 
							
						 
					 
					
						
						
							
							update tools/library/CMakeLists to require python 3.6 according to  #70  ( #82 )  
						
						... 
						
						
						
						#70  only updates the documentation. This commit reflects this bump in python version to the CMake configuration as well. 
					
						2020-04-08 10:54:36 -07:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							96dab34ad9 
							
						 
					 
					
						
						
							
							CUTLASS 2.1 ( #83 )  
						
						... 
						
						
						
						CUTLASS 2.1 contributes:
- BLAS-style host-side API added to CUTLASS Library
- Planar Complex GEMM kernels targeting Volta and Turing Tensor Cores
- Minor enhancements and bug fixes 
						
					 
					
						2020-04-07 13:51:25 -07:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							7c0cd26d13 
							
						 
					 
					
						
						
							
							Need Python 3.6 to use enum.auto() ( #70 )  
						
						
						
					 
					
						2019-11-22 09:39:12 -08:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							45ecbc885b 
							
						 
					 
					
						
						
							
							Removed redundant conjugation operations from matrix_traits. ( #65 )  
						
						
						
					 
					
						2019-11-20 11:27:13 -08:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							8aca98f9a7 
							
						 
					 
					
						
						
							
							Improved formatting, clarity, and content of several documents. ( #64 )  
						
						... 
						
						
						
						* Improved formatting, clarity, and content of several documents. 
						
					 
					
						2019-11-20 10:42:15 -08:00 
						 
				 
			
				
					
						
							
							
								Dustyn Blasig 
							
						 
					 
					
						
						
						
						
							
						
						
							f4d9c8f755 
							
						 
					 
					
						
						
							
							Clang GPU compilation requires explicit CUDACC version flags ( #63 )  
						
						
						
					 
					
						2019-11-20 09:52:11 -08:00 
						 
				 
			
				
					
						
							
							
								Andrew Kerr 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							fb335f6a5f 
							
						 
					 
					
						
						
							
							CUTLASS 2.0 ( #62 )  
						
						... 
						
						
						
						CUTLASS 2.0
Substantially refactored for
- Better performance, particularly for native Turing Tensor Cores
- Robust and durable templates spanning the design space
- Encapsulated functionality embodying modern C++11 programming techniques
- Optimized containers and data types for efficient, generic, portable device code
Updates to:
- Quick start guide
- Documentation
- Utilities
- CUTLASS Profiler
Native Turing Tensor Cores
- Efficient GEMM kernels targeting Turing Tensor Cores
- Mixed-precision floating point, 8-bit integer, 4-bit integer, and binarized operands
Coverage of existing CUTLASS functionality:
- GEMM kernels targeting CUDA and Tensor Cores in NVIDIA GPUs
- Volta Tensor Cores through native mma.sync and through WMMA API
- Optimizations such as parallel reductions, threadblock rasterization, and intra-threadblock reductions
- Batched GEMM operations
- Complex-valued GEMMs
Note: this commit and all that follow require a host compiler supporting C++11 or greater. 
						
					 
					
						2019-11-19 16:55:34 -08:00