Update CHANGELOG.md
This commit is contained in:
parent
cc2ea4c3fc
commit
96dad61a75
19
CHANGELOG.md
19
CHANGELOG.md
@ -8,7 +8,7 @@
|
|||||||
* [Unit tests](/test/unit/conv/device/conv2d_fprop_few_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu)
|
* [Unit tests](/test/unit/conv/device/conv2d_fprop_few_channels_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu)
|
||||||
* [Python-based instance emitter](/tools/library/scripts/generator.py) in the CUTLASS Library and support in the Profiler
|
* [Python-based instance emitter](/tools/library/scripts/generator.py) in the CUTLASS Library and support in the Profiler
|
||||||
* [BLAS3](https://docs.nvidia.com/cuda/cublas/index.html#cublas-level-3-function-reference) operators accelerated by Tensor Cores
|
* [BLAS3](https://docs.nvidia.com/cuda/cublas/index.html#cublas-level-3-function-reference) operators accelerated by Tensor Cores
|
||||||
* Supported types: f32, cf32, f64, cf64
|
* Supported types: f32, cf32, f64, cf64, tf32x3, complex tf32x3
|
||||||
* [HERK](/test/unit/gemm/device/her2k_cf32h_cf32n_tensor_op_fast_f32_sm80.cu) with [emitter](/tools/library/scripts/rank_k_operation.py)
|
* [HERK](/test/unit/gemm/device/her2k_cf32h_cf32n_tensor_op_fast_f32_sm80.cu) with [emitter](/tools/library/scripts/rank_k_operation.py)
|
||||||
* [SYRK](/test/unit/gemm/device/syrk_f32n_f32t_tensor_op_fast_f32_sm80.cu) with [emitter](/tools/library/scripts/rank_k_operation.py)
|
* [SYRK](/test/unit/gemm/device/syrk_f32n_f32t_tensor_op_fast_f32_sm80.cu) with [emitter](/tools/library/scripts/rank_k_operation.py)
|
||||||
* [SYMM](/test/unit/gemm/device/symm_f32n_f32n_tensor_op_fast_f32_ls_sm80.cu) with [emitter](/tools/library/scripts/symm_operation.py)
|
* [SYMM](/test/unit/gemm/device/symm_f32n_f32n_tensor_op_fast_f32_ls_sm80.cu) with [emitter](/tools/library/scripts/symm_operation.py)
|
||||||
@ -17,10 +17,25 @@
|
|||||||
* [CUTLASS Python](/example/40_cutlass_py) demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using [CUDA Python](https://developer.nvidia.com/cuda-python)
|
* [CUTLASS Python](/example/40_cutlass_py) demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using [CUDA Python](https://developer.nvidia.com/cuda-python)
|
||||||
* [Python-based runtime](/tools/library/scripts/rt.py) interoperable with existing emitters
|
* [Python-based runtime](/tools/library/scripts/rt.py) interoperable with existing emitters
|
||||||
* [GEMM + Softmax example](/examples/35_gemm_softmax)
|
* [GEMM + Softmax example](/examples/35_gemm_softmax)
|
||||||
|
* [Gather and Scatter Fusion with GEMM](/examples/36_gather_scatter_fusion) can gather inputs and scatters outputs based on indices vectors in the same GEMM kernel.
|
||||||
|
* It can select random rows in a row major matrix.
|
||||||
|
* It can select random columns in a column major matrix.
|
||||||
|
* [Back-to-back GEMM/CONV](examples/13_two_tensor_op_fusion) fully supports buffering the previous GEMM/CONV results in the shared memory for the latter one to use. It can eliminate register spill when the tile size is big.
|
||||||
|
* Supported kernels: GEMM and CONV.
|
||||||
|
* Supported types: fp16 and int8.
|
||||||
|
* Supported architectures: Turing and Ampere.
|
||||||
|
* [Transposed Convolution](/examples/34_transposed_conv2d) (a.k.a Deconvolution) support which reuses Dgrad implementation.
|
||||||
|
* [Utility functions](/tools/util/include/cutlass/util) that can pad NHWC and convert between NCHW and NHWC.
|
||||||
|
* [Small alignment implicit gemm](https://github.com/NVIDIA/cutlass/issues/242) support for Fprop/Dgrad/Wgrad so that padding is no longer mandated to use tensor cores in these kernels.
|
||||||
|
* Epilogue enhancement:
|
||||||
|
* Eliminate bank conflicts in int8 tensor core kernels.
|
||||||
|
* Half2 usage if epilogue compute type is fp16.
|
||||||
|
* More activation functions: Silu, Hardswish.
|
||||||
|
* New elementwise fusion pattern for [residual block](/include/cutlass/epilogue/thread/linear_combination_residual_block.h).
|
||||||
|
* [Parallel GEMM splitk](https://github.com/NVIDIA/cutlass/pull/277) support in the CUTLASS profiler.
|
||||||
* Optimal performance using [**CUDA 11.6u2**](https://developer.nvidia.com/cuda-downloads)
|
* Optimal performance using [**CUDA 11.6u2**](https://developer.nvidia.com/cuda-downloads)
|
||||||
* Updates and bugfixes from the community (thanks!)
|
* Updates and bugfixes from the community (thanks!)
|
||||||
|
|
||||||
|
|
||||||
## [2.8.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.8.0) (2021-11-19)
|
## [2.8.0](https://github.com/NVIDIA/cutlass/releases/tag/v2.8.0) (2021-11-19)
|
||||||
|
|
||||||
* **TF32x3:** emulated single-precision using Tensor Cores
|
* **TF32x3:** emulated single-precision using Tensor Cores
|
||||||
|
Loading…
Reference in New Issue
Block a user