From 1d7f2a207ec215e037099f4ba5632ccfa0249673 Mon Sep 17 00:00:00 2001 From: wang-y-z <57429717+wang-y-z@users.noreply.github.com> Date: Fri, 3 Nov 2023 12:01:25 +0800 Subject: [PATCH] Fix several broken links (#1168) Co-authored-by: isaacw --- CHANGELOG.md | 14 +++++++------- .../building_in_windows_with_visual_studio.md | 2 +- .../build/building_with_clang_as_host_compiler.md | 2 +- media/docs/efficient_gemm.md | 2 +- media/docs/pipeline.md | 2 +- 5 files changed, 11 insertions(+), 11 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 80f71239..97af0acc 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -104,7 +104,7 @@ * [Grouped convolution targeting implicit GEMM](test/unit/conv/device/group_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_tensor_op_f32_sm80.cu) introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized. The restrictions are: 1) input and output channel number should be multiple of group number. 2) split-K is not supported. The implementation has 2 modes: * kSingleGroup: output channel per group is multiple of Threadblock tile N. * kMultipleGroup: Threadblock tile N is multiple of output channel per group. -* [Depthwise separable convolution](test/unit/conv/device/depthwise_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu) introduces the first depthwise convolution which is also Analytical for now. The restrictions are: 1) SIMT only 2) No split-K 3) input channel equals to output channel equals to group number. +* [Depthwise separable convolution](test/unit/conv/device/depthwise_conv2d_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu) introduces the first depthwise convolution which is also Analytical for now. The restrictions are: 1) SIMT only 2) No split-K 3) input channel equals to output channel equals to group number. * Standalone [Layernorm](/tools/util/include/cutlass/util/device_layernorm.h) and [Pooling](/tools/util/include/cutlass/util/device_nhwc_pooling.h) kernels. * [Back-to-back GEMM/CONV](examples/13_two_tensor_op_fusion) relaxes the requirement that the first GEMM K dimension needs to be the multiple of Threadblock Tile K dimension. * Optimal performance using [**CUDA 11.6u2**](https://developer.nvidia.com/cuda-downloads) @@ -119,10 +119,10 @@ * [Python-based instance emitter](/python/cutlass_library/generator.py) in the CUTLASS Library and support in the Profiler * [BLAS3](https://docs.nvidia.com/cuda/cublas/index.html#cublas-level-3-function-reference) operators accelerated by Tensor Cores * Supported types: f32, cf32, f64, cf64, tf32x3, complex tf32x3 - * [HERK](/test/unit/gemm/device/her2k_cf32h_cf32n_tensor_op_fast_f32_sm80.cu) with [emitter](/tools/library/scripts/rank_k_operation.py) - * [SYRK](/test/unit/gemm/device/syrk_f32n_f32t_tensor_op_fast_f32_sm80.cu) with [emitter](/tools/library/scripts/rank_k_operation.py) - * [SYMM](/test/unit/gemm/device/symm_f32n_f32n_tensor_op_fast_f32_ls_sm80.cu) with [emitter](/tools/library/scripts/symm_operation.py) - * [TRMM](/test/unit/gemm/device/trmm_f32n_f32t_f32t_tensor_op_fast_f32_ls_sm80.cu) with [emitter](/tools/library/scripts/trmm_operation.py) + * [HERK](/test/unit/gemm/device/her2k_cf32h_cf32n_tensor_op_fast_f32_sm80.cu) with [emitter](/python/cutlass_library/rank_k_operation.py) + * [SYRK](/test/unit/gemm/device/syrk_f32n_f32t_tensor_op_fast_f32_sm80.cu) with [emitter](/python/cutlass_library/rank_k_operation.py) + * [SYMM](/test/unit/gemm/device/symm_f32n_f32n_tensor_op_fast_f32_ls_sm80.cu) with [emitter](/python/cutlass_library/symm_operation.py) + * [TRMM](/test/unit/gemm/device/trmm_f32n_f32t_f32t_tensor_op_fast_f32_ls_sm80.cu) with [emitter](/python/cutlass_library/trmm_operation.py) * [Unit tests](/test/unit/gemm/device/testbed_rank_k_universal.h) * [CUTLASS Python](/examples/40_cutlass_py) demonstrating JIT compilation of CUTLASS kernels and a Python-based runtime using [CUDA Python](https://developer.nvidia.com/cuda-python) * [Python-based runtime](/tools/library/scripts/rt.py) interoperable with existing emitters @@ -153,7 +153,7 @@ * **TF32x3:** emulated single-precision using Tensor Cores * 45+ TFLOPs on NVIDIA A100 * [GEMM SDK example](/examples/27_ampere_3xtf32_fast_accurate_tensorop_gemm/27_ampere_3xtf32_fast_accurate_tensorop_gemm.cu) (real) - * [COMPLEX GEMM SDK example](/examples/29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm/29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm.cu) (complex) + * [COMPLEX GEMM SDK example](/examples/29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm/29_3xtf32_complex_gemm.cu) (complex) * [Implicit GEMM Convolution SDK example](/examples/28_ampere_3xtf32_fast_accurate_tensorop_fprop/ampere_3xtf32_fast_accurate_tensorop_fprop.cu) * **Mainloop fusion for Convolution:** convolution with fused per-channel scale-bias-relu * [Conv Fprop SDK example](/examples/25_ampere_fprop_mainloop_fusion/ampere_fprop_mainloop_fusion.cu) @@ -205,7 +205,7 @@ * Support using new `Dy` and `w` analytic iterators and existing `cutlass::conv::device::ImplicitGemmConvolution` interface * Quaternion-valued GEMM and Convolution in single- and double-precision (targeting CUDA Cores) * Updates to [quaternion.h](/include/cutlass/quaternion.h) and [functional.h](/include/cutlass/functional.h) - * SDK Example for [GEMM](/examples/21_quaternion_gemm/quaternion_gemm.cu) and [Convolution](/examples/22_quaternion_gemm/quaternion_conv.cu) + * SDK Example for [GEMM](/examples/21_quaternion_gemm/quaternion_gemm.cu) and [Convolution](/examples/22_quaternion_conv/quaternion_conv.cu) * [Unit tests for GEMM](/test/unit/gemm/device/simt_qgemm_nn_sm50.cu) and [Convolution](/test/unit/conv/device/conv2d_fprop_implicit_gemm_qf32nhwc_qf32nhwc_qf32nhwc_simt_f32_sm50.cu) * Many improvements to the epilogue. * Provide an [option](/include/cutlass/epilogue/threadblock/epilogue.h) to not fully unroll the epilogue to reduce the code size and improve the performance when using complicated elementwise operations diff --git a/media/docs/build/building_in_windows_with_visual_studio.md b/media/docs/build/building_in_windows_with_visual_studio.md index 51bdf7e5..4915cc31 100644 --- a/media/docs/build/building_in_windows_with_visual_studio.md +++ b/media/docs/build/building_in_windows_with_visual_studio.md @@ -1,4 +1,4 @@ -[README](../README.md#documentation) > **CUTLASS 3.0: Building on Windows with Visual Studio** +[README](/README.md#documentation) > **CUTLASS 3.0: Building on Windows with Visual Studio** # Building on Windows with Visual Studio diff --git a/media/docs/build/building_with_clang_as_host_compiler.md b/media/docs/build/building_with_clang_as_host_compiler.md index cde92206..6a46e82f 100644 --- a/media/docs/build/building_with_clang_as_host_compiler.md +++ b/media/docs/build/building_with_clang_as_host_compiler.md @@ -1,4 +1,4 @@ -[README](../README.md#documentation) > **CUTLASS 3: Building with Clang as host compiler** +[README](/README.md#documentation) > **CUTLASS 3: Building with Clang as host compiler** # Building with Clang as host compiler diff --git a/media/docs/efficient_gemm.md b/media/docs/efficient_gemm.md index ddb9043c..7501a7c5 100644 --- a/media/docs/efficient_gemm.md +++ b/media/docs/efficient_gemm.md @@ -241,7 +241,7 @@ The third kernel design is the [*Warp-Specialized Persistent Ping-Pong*](/includ Like the Warp-Specialized Persistent Cooperative, kernel the concepts of warp groups, barrier synchronization between warp groups, and the shape of the grid launch remain the same in the persistent ping-pong design. The distinctive feature of the Warp-Specialized Persistent Ping-Pong kernel is the following : * The two *consumer* warp groups are assigned a different output tile using the Tile Scheduler. This allows for *epilogue* of one *consumer* warp group to be overlapped with the math operations of the other *consumer* warp group - thus maximizing tensor core utilization. -* The *producer* warp group synchronizes using the [Ordered Sequence Barrier](/include/cutlass/pipeline.hpp) to fill buffers of the two *consumer* warp groups one after the other in order. +* The *producer* warp group synchronizes using the [Ordered Sequence Barrier](/include/cutlass/pipeline/pipeline.hpp) to fill buffers of the two *consumer* warp groups one after the other in order. # Resources diff --git a/media/docs/pipeline.md b/media/docs/pipeline.md index 1e15b4e4..cd051bcc 100644 --- a/media/docs/pipeline.md +++ b/media/docs/pipeline.md @@ -174,7 +174,7 @@ Please note that this is a basic example. There are different versions possible, depending on what the producer and consumer threads are doing. Please refer to our [unit tests](/test/unit/pipeline) -and the other [pipeline classes](/include/cutlass/pipeline.hpp) +and the other [pipeline classes](/include/cutlass/pipeline/pipeline.hpp) for more details. # Copyright