From 5ae8133cfad6fc4e0c550b2ccc54bfce116fc70a Mon Sep 17 00:00:00 2001 From: Manish Gupta Date: Mon, 13 Nov 2023 10:29:22 -0800 Subject: [PATCH] Doc only change changelog 3.3 (#1180) --- CHANGELOG.md | 10 +++++----- README.md | 8 ++++---- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 97af0acc..c0a7ad8e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,16 +1,16 @@ # NVIDIA CUTLASS Changelog ## [3.3](https://github.com/NVIDIA/cutlass/releases/tag/v3.3) (2023-10-31) -* [Mixed Precision Hopper GEMMs](/examples/55_hopper_mixed_dtype_gemm) support covering 16-bit x 8-bit input operand types. -* [Mixed Precision Ampere GEMMs](https://github.com/NVIDIA/cutlass/commit/7d8317a63e0a978a8dbb3c1fb7af4dbe4f286616) with support for canonical layouts (TN) and {fp16, bf16} x {s8/u8}. +* [Mixed-input Hopper GEMMs](/examples/55_hopper_mixed_dtype_gemm) support covering 16-bit x 8-bit input operand types. +* [Mixed-input Ampere GEMMs](https://github.com/NVIDIA/cutlass/pull/1084) with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8}, and upcast on operandA {s8, u8} x {fp16, bf16}. * [Copy Async based Hopper GEMMs](/test/unit/gemm/device/sm90_gemm_bf16_bf16_bf16_alignx_tensor_op_f32_warpspecialized_cooperative.cu) - which support lower than 16B aligned input tensors. * Kernel schedules and Builder support for mixed precision and Copy Async GEMMs with < 16B aligned input tensors. * Profiler support for lower-aligned Hopper GEMMs. -* Performance Improvements to [Scatter-Gather Hopper Example](/examples/52_hopper_gather_scatter_fusion) -* Sub-Byte type fixes and improvements +* Performance Improvements to [Scatter-Gather Hopper Example](/examples/52_hopper_gather_scatter_fusion). +* Sub-Byte type fixes and improvements. * EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See [SM90 EVT fusions](/include/cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp) for details. * Fusion support for backprop fusions including drelu, dgelu, and dbias. -* Support for void-C kernels and SM80 mixed-precision GEMMs in the CUTLASS Python interface +* Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface. ## [3.2.2](https://github.com/NVIDIA/cutlass/releases/tag/v3.2.1) (2023-10-25) * Minor patch for issue/1138 diff --git a/README.md b/README.md index f1ea3b43..36b0d151 100644 --- a/README.md +++ b/README.md @@ -45,13 +45,13 @@ In addition to GEMMs, CUTLASS implements high-performance convolution via the im CUTLASS 3.3.0 is an update to CUTLASS adding: -- New [Mixed Precision Hopper GEMMs](/examples/55_hopper_mixed_dtype_gemm) support covering 16-bit x 8-bit input types with optimal performance. -- New [Mixed Precision Ampere GEMMs](https://github.com/NVIDIA/cutlass/commit/7d8317a63e0a978a8dbb3c1fb7af4dbe4f286616) with support for canonical layouts (TN) and {fp16, bf16} x {s8/u8}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance. +- New [Mixed-input Hopper GEMMs](/examples/55_hopper_mixed_dtype_gemm) support covering 16-bit x 8-bit input types with optimal performance. +- New [Mixed-input Ampere GEMMs](https://github.com/NVIDIA/cutlass/pull/1084) with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance. - New [Copy Async based Hopper GEMMs](/test/unit/gemm/device/sm90_gemm_bf16_bf16_bf16_alignx_tensor_op_f32_warpspecialized_cooperative.cu) - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops [SM80\_CP\_ASYNC\_CACHE\_\*](/include/cute/arch/copy_sm80.hpp) were also added. - EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See [SM90 EVT fusions](/include/cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp) for details. - Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support. -- Support for Clang as a host compiler -- Support for void-C kernels and SM80 mixed-precision GEMMs in the CUTLASS Python interface +- Support for Clang as a host compiler. +- Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface. Minimum requirements: