diff --git a/README.md b/README.md
index dc7cde32..79c11b3e 100644
--- a/README.md
+++ b/README.md
@@ -328,7 +328,7 @@ or a subset of kernels for NVIDIA Ampere and Turing architecture:
### Building a subset Tensor Core GEMM kernels
-To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targetting NVIDIA Ampere and Turing architecture,
+To compile a subset of Tensor Core GEMM kernels with FP32 accumulation and FP16 input targeting NVIDIA Ampere and Turing architecture,
use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*gemm_f16_*_nt_align8
@@ -376,7 +376,7 @@ reference_device: Passed
### Building one CUDA Core GEMM kernel
-To compile one SGEMM kernel targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+To compile one SGEMM kernel targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sgemm_128x128_8x2_nn_align1
...
@@ -418,7 +418,7 @@ $ ./tools/profiler/cutlass_profiler --kernels=sgemm --m=3456 --n=4096 --k=4096
### Building a subset of Tensor Core Convolution kernels
To compile a subset of Tensor core convolution kernels implementing forward propagation (fprop) with FP32 accumulation
-and FP16 input targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+and FP16 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s*fprop_optimized_f16
...
@@ -466,7 +466,7 @@ reference_device: Passed
### Building one Convolution CUDA kernel
To compile and run one CUDA Core convolution kernel implementing forward propagation (fprop) with F32 accumulation
-and FP32 input targetting NVIDIA Ampere and Turing architecture, use the below cmake command line:
+and FP32 input targeting NVIDIA Ampere and Turing architecture, use the below cmake command line:
```bash
$ cmake .. -DCUTLASS_NVCC_ARCHS='75;80' -DCUTLASS_LIBRARY_KERNELS=cutlass_simt_sfprop_optimized_128x128_8x2_nhwc
...
diff --git a/docs/annotated.html b/docs/annotated.html
index 43923cc2..233691c2 100644
--- a/docs/annotated.html
+++ b/docs/annotated.html
@@ -280,15 +280,15 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
CDefaultGemmConfiguration< arch::OpClassWmmaTensorOp, ArchTag, ElementA, ElementB, ElementC, ElementAccumulator > | |
►CGemm | |
CArguments | Argument structure |
- ►CGemm< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, IsBetaZero > | Parital specialization for column-major output exchanges problem size and operand |
+ ►CGemm< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, IsBetaZero > | Partial specialization for column-major output exchanges problem size and operand |
CArguments | Argument structure |
►CGemmBatched | |
CArguments | Argument structure |
- ►CGemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ > | Parital specialization for column-major output exchanges problem size and operand |
+ ►CGemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ > | Partial specialization for column-major output exchanges problem size and operand |
CArguments | Argument structure |
►CGemmComplex | |
CArguments | Argument structure |
- ►CGemmComplex< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial > | Parital specialization for column-major output exchanges problem size and operand |
+ ►CGemmComplex< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial > | Partial specialization for column-major output exchanges problem size and operand |
CArguments | Argument structure |
►CGemmSplitKParallel | |
CArguments | Argument structure |
@@ -594,7 +594,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
CGemm | |
CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAdd > | Partial specialization for multiply-add |
CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAddSaturate > | Partial specialization for multiply-add-saturate |
- CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpXorPopc > | Parital specialization for XOR-popc |
+ CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpXorPopc > | Partial specialization for XOR-popc |
CTensorDiagonalForEach | Launches a kernel calling a functor for each element along a tensor's diagonal |
CTensorForEach | Launches a kernel calling a functor for each element in a tensor's index space |
►Nhost | |
@@ -620,7 +620,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
CGemm | |
CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAdd > | Partial specialization for multiply-add |
CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAddSaturate > | Partial specialization for multiply-add-saturate |
- CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpXorPopc > | Parital specialization for XOR-popc |
+ CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpXorPopc > | Partial specialization for XOR-popc |
►Nthread | |
CMatrix | Per-thread matrix object storing a packed matrix |
►Ntransform | |
diff --git a/docs/classcutlass_1_1gemm_1_1device_1_1GemmBatched_3_01ElementA___00_01LayoutA___00_01ElementB___00_0c9bb6f4463ab6085e6008b5d5ad6abfd.html b/docs/classcutlass_1_1gemm_1_1device_1_1GemmBatched_3_01ElementA___00_01LayoutA___00_01ElementB___00_0c9bb6f4463ab6085e6008b5d5ad6abfd.html
index d53d8d67..6800f4fe 100644
--- a/docs/classcutlass_1_1gemm_1_1device_1_1GemmBatched_3_01ElementA___00_01LayoutA___00_01ElementB___00_0c9bb6f4463ab6085e6008b5d5ad6abfd.html
+++ b/docs/classcutlass_1_1gemm_1_1device_1_1GemmBatched_3_01ElementA___00_01LayoutA___00_01ElementB___00_0c9bb6f4463ab6085e6008b5d5ad6abfd.html
@@ -108,7 +108,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
-
Parital specialization for column-major output exchanges problem size and operand.
+
Partial specialization for column-major output exchanges problem size and operand.
#include <gemm_batched.h>
diff --git a/docs/classcutlass_1_1gemm_1_1device_1_1GemmComplex_3_01ElementA___00_01LayoutA___00_01ElementB___00_07c56401b4df75709ae636675d9980a9a.html b/docs/classcutlass_1_1gemm_1_1device_1_1GemmComplex_3_01ElementA___00_01LayoutA___00_01ElementB___00_07c56401b4df75709ae636675d9980a9a.html
index 84685856..d0978343 100644
--- a/docs/classcutlass_1_1gemm_1_1device_1_1GemmComplex_3_01ElementA___00_01LayoutA___00_01ElementB___00_07c56401b4df75709ae636675d9980a9a.html
+++ b/docs/classcutlass_1_1gemm_1_1device_1_1GemmComplex_3_01ElementA___00_01LayoutA___00_01ElementB___00_07c56401b4df75709ae636675d9980a9a.html
@@ -108,7 +108,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
-
Parital specialization for column-major output exchanges problem size and operand.
+
Partial specialization for column-major output exchanges problem size and operand.
#include <gemm_complex.h>
diff --git a/docs/classcutlass_1_1gemm_1_1device_1_1Gemm_3_01ElementA___00_01LayoutA___00_01ElementB___00_01Layout4d0960ae6b1d1bf19e6239dbd002249c.html b/docs/classcutlass_1_1gemm_1_1device_1_1Gemm_3_01ElementA___00_01LayoutA___00_01ElementB___00_01Layout4d0960ae6b1d1bf19e6239dbd002249c.html
index f34be6e5..323ecfc2 100644
--- a/docs/classcutlass_1_1gemm_1_1device_1_1Gemm_3_01ElementA___00_01LayoutA___00_01ElementB___00_01Layout4d0960ae6b1d1bf19e6239dbd002249c.html
+++ b/docs/classcutlass_1_1gemm_1_1device_1_1Gemm_3_01ElementA___00_01LayoutA___00_01ElementB___00_01Layout4d0960ae6b1d1bf19e6239dbd002249c.html
@@ -108,7 +108,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
-
Parital specialization for column-major output exchanges problem size and operand.
+
Partial specialization for column-major output exchanges problem size and operand.
#include <gemm.h>
diff --git a/docs/command__line_8h_source.html b/docs/command__line_8h_source.html
index f98c9f1e..325a3034 100644
--- a/docs/command__line_8h_source.html
+++ b/docs/command__line_8h_source.html
@@ -98,7 +98,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
command_line.h
-
Go to the documentation of this file. 33 #include <cuda_runtime.h> 45 std::vector<std::string>
keys;
47 std::vector<std::string>
args;
55 for (
int i = 1; i < argc; i++) {
58 if ((arg[0] !=
'-') || (arg[1] !=
'-')) {
63 string::size_type pos;
65 if ((pos = arg.find(
'=')) == string::npos) {
66 key = string(arg, 2, arg.length() - 2);
69 key = string(arg, 2, pos - 2);
70 val = string(arg, pos + 1, arg.length() - 1);
74 values.push_back(val);
84 for (
int i = 0; i < int(keys.size()); ++i) {
85 if (keys[i] ==
string(arg_name))
return true;
93 template <
typename value_t>
101 template <
typename value_t>
104 if (index < args.size()) {
105 istringstream str_stream(args[index]);
119 val = !(value ==
"0" || value ==
"false");
126 template <
typename value_t>
129 value_t
const& _default = value_t())
const {
134 for (
int i = 0; i < int(keys.size()); ++i) {
135 if (keys[i] ==
string(arg_name)) {
136 istringstream str_stream(values[i]);
145 template <
typename value_t>
147 std::vector<value_t>& vals,
148 char sep =
',')
const {
156 for (
int i = 0; i < keys.size(); ++i) {
157 if (keys[i] ==
string(arg_name)) {
158 string val_string(values[i]);
170 std::vector<std::pair<std::string, std::string> >& tokens,
172 char sep =
':')
const {
177 tokenize(tokens, value, delim, sep);
186 std::vector<std::vector<std::string> >& vals,
188 char sep =
':')
const {
189 std::vector<std::string> ranges;
192 for (std::vector<std::string>::const_iterator range = ranges.begin();
193 range != ranges.end(); ++range) {
195 std::vector<std::string> range_vals;
197 vals.push_back(range_vals);
211 static void tokenize(std::vector<std::pair<std::string, std::string> >& tokens,
212 std::string
const& str,
217 size_t d_idx = std::string::npos;
218 while (s_idx < str.size()) {
219 d_idx = str.find_first_of(delim, s_idx);
221 size_t end_idx = (d_idx != std::string::npos ? d_idx : str.size());
222 size_t sep_idx = str.find_first_of(sep, s_idx);
224 if (sep_idx == std::string::npos || sep_idx >= end_idx) {
229 std::pair<std::string, std::string> item(
230 str.substr(s_idx, sep_idx - s_idx),
231 str.substr(sep_idx + offset, end_idx - sep_idx - offset));
233 tokens.push_back(item);
239 static void tokenize(std::vector<std::string>& tokens,
240 std::string
const& str,
243 typedef std::vector<std::pair<std::string, std::string> > TokenVector;
244 typedef TokenVector::const_iterator token_iterator;
246 std::vector<std::pair<std::string, std::string> > token_pairs;
247 tokenize(token_pairs, str, delim, sep);
248 for (token_iterator tok = token_pairs.begin(); tok != token_pairs.end(); ++tok) {
249 tokens.push_back(tok->first);
253 template <
typename value_t>
255 std::vector<value_t>& vals,
257 std::istringstream str_stream(str);
258 std::string::size_type old_pos = 0;
259 std::string::size_type new_pos = 0;
263 while ((new_pos = str.find(sep, old_pos)) != std::string::npos) {
264 if (new_pos != old_pos) {
265 str_stream.width(new_pos - old_pos);
271 str_stream.ignore(1);
272 old_pos = new_pos + 1;
Definition: aligned_buffer.h:35
+
Go to the documentation of this file. 33 #include <cuda_runtime.h> 45 std::vector<std::string>
keys;
47 std::vector<std::string>
args;
55 for (
int i = 1; i < argc; i++) {
58 if ((arg[0] !=
'-') || (arg[1] !=
'-')) {
63 string::size_type pos;
65 if ((pos = arg.find(
'=')) == string::npos) {
66 key = string(arg, 2, arg.length() - 2);
69 key = string(arg, 2, pos - 2);
70 val = string(arg, pos + 1, arg.length() - 1);
74 values.push_back(val);
84 for (
int i = 0; i < int(keys.size()); ++i) {
85 if (keys[i] ==
string(arg_name))
return true;
93 template <
typename value_t>
101 template <
typename value_t>
104 if (index < args.size()) {
105 istringstream str_stream(args[index]);
119 val = !(value ==
"0" || value ==
"false");
126 template <
typename value_t>
129 value_t
const& _default = value_t())
const {
134 for (
int i = 0; i < int(keys.size()); ++i) {
135 if (keys[i] ==
string(arg_name)) {
136 istringstream str_stream(values[i]);
145 template <
typename value_t>
147 std::vector<value_t>& vals,
148 char sep =
',')
const {
156 for (
int i = 0; i < keys.size(); ++i) {
157 if (keys[i] ==
string(arg_name)) {
158 string val_string(values[i]);
170 std::vector<std::pair<std::string, std::string> >& tokens,
172 char sep =
':')
const {
177 tokenize(tokens, value, delim, sep);
186 std::vector<std::vector<std::string> >& vals,
188 char sep =
':')
const {
189 std::vector<std::string> ranges;
192 for (std::vector<std::string>::const_iterator range = ranges.begin();
193 range != ranges.end(); ++range) {
195 std::vector<std::string> range_vals;
197 vals.push_back(range_vals);
211 static void tokenize(std::vector<std::pair<std::string, std::string> >& tokens,
212 std::string
const& str,
217 size_t d_idx = std::string::npos;
218 while (s_idx < str.size()) {
219 d_idx = str.find_first_of(delim, s_idx);
221 size_t end_idx = (d_idx != std::string::npos ? d_idx : str.size());
222 size_t sep_idx = str.find_first_of(sep, s_idx);
224 if (sep_idx == std::string::npos || sep_idx >= end_idx) {
229 std::pair<std::string, std::string> item(
230 str.substr(s_idx, sep_idx - s_idx),
231 str.substr(sep_idx + offset, end_idx - sep_idx - offset));
233 tokens.push_back(item);
239 static void tokenize(std::vector<std::string>& tokens,
240 std::string
const& str,
243 typedef std::vector<std::pair<std::string, std::string> > TokenVector;
244 typedef TokenVector::const_iterator token_iterator;
246 std::vector<std::pair<std::string, std::string> > token_pairs;
247 tokenize(token_pairs, str, delim, sep);
248 for (token_iterator tok = token_pairs.begin(); tok != token_pairs.end(); ++tok) {
249 tokens.push_back(tok->first);
253 template <
typename value_t>
255 std::vector<value_t>& vals,
257 std::istringstream str_stream(str);
258 std::string::size_type old_pos = 0;
259 std::string::size_type new_pos = 0;
263 while ((new_pos = str.find(sep, old_pos)) != std::string::npos) {
264 if (new_pos != old_pos) {
265 str_stream.width(new_pos - old_pos);
271 str_stream.ignore(1);
272 old_pos = new_pos + 1;
Definition: aligned_buffer.h:35
void get_cmd_line_argument(const char *arg_name, value_t &val, value_t const &_default=value_t()) const
Definition: command_line.h:127
void get_cmd_line_argument_pairs(const char *arg_name, std::vector< std::pair< std::string, std::string > > &tokens, char delim= ',', char sep= ':') const
Definition: command_line.h:169
@@ -116,7 +116,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
CommandLine(int argc, const char **argv)
Definition: command_line.h:52
std::vector< std::string > args
Definition: command_line.h:47
Definition: command_line.h:44
-
static void seperate_string(std::string const &str, std::vector< value_t > &vals, char sep= ',')
Definition: command_line.h:254
+
static void separate_string(std::string const &str, std::vector< value_t > &vals, char sep= ',')
Definition: command_line.h:254
int parsed_argc() const
Definition: command_line.h:204
diff --git a/docs/device_2gemm__batched_8h.html b/docs/device_2gemm__batched_8h.html
index e648c67e..3ef58bf1 100644
--- a/docs/device_2gemm__batched_8h.html
+++ b/docs/device_2gemm__batched_8h.html
@@ -130,7 +130,7 @@ Classes
| Argument structure. More...
|
|
class | cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ > |
-
| Parital specialization for column-major output exchanges problem size and operand. More...
|
+
| Partial specialization for column-major output exchanges problem size and operand. More...
|
|
struct | cutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ >::Arguments |
| Argument structure. More...
|
diff --git a/docs/device_2kernel_2tensor__foreach_8h_source.html b/docs/device_2kernel_2tensor__foreach_8h_source.html
index a4839c25..21c8fb59 100644
--- a/docs/device_2kernel_2tensor__foreach_8h_source.html
+++ b/docs/device_2kernel_2tensor__foreach_8h_source.html
@@ -100,7 +100,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
Go to the documentation of this file. 42 template <
typename Func,
int Rank,
int RankRemaining>
52 for (
int i = Rank - RankRemaining; i < Rank; ++i) {
56 coord[Rank - 1 - RankRemaining] = index / product;
57 int64_t remaining = index % product;
64 template <
typename Func,
int Rank>
71 coord[Rank - 1] = index;
84 template <
typename Func,
int Rank,
typename Params>
89 int64_t index = threadIdx.x + blockIdx.x * blockDim.x;
90 int64_t max_index = 1;
93 for (
int i = 0; i < Rank; ++i) {
98 while (index < max_index) {
102 index += blockDim.x * gridDim.x;
109 template <
typename Func,
int Rank,
typename Params>
114 int64_t index = threadIdx.x + blockIdx.x * blockDim.x + start;
120 for (
int i = 0; i < Rank; ++i) {
130 template <
typename Element,
typename Func>
134 typename Func::Params params) {
138 size_t index = threadIdx.x + blockIdx.x * blockDim.x;
140 for (; index < capacity; index += blockDim.x * gridDim.x) {
Definition: aligned_buffer.h:35
A Coord is a coordinate of arbitrary rank into a tensor or matrix.
-
__inline__ __device__ TensorForEachHelper(Func &func, Coord< Rank > const &size, Coord< Rank > &coord, int64_t index)
Constructor for fastest chaning rank.
Definition: device/kernel/tensor_foreach.h:69
+
__inline__ __device__ TensorForEachHelper(Func &func, Coord< Rank > const &size, Coord< Rank > &coord, int64_t index)
Constructor for fastest changing rank.
Definition: device/kernel/tensor_foreach.h:69
#define CUTLASS_PRAGMA_UNROLL
Definition: cutlass.h:110
__global__ void BlockForEach(Element *ptr, size_t capacity, typename Func::Params params)
Definition: device/kernel/tensor_foreach.h:131
#define CUTLASS_PRAGMA_NO_UNROLL
Definition: cutlass.h:111
diff --git a/docs/device_2tensor__fill_8h.html b/docs/device_2tensor__fill_8h.html
index c2af8c44..5a99459c 100644
--- a/docs/device_2tensor__fill_8h.html
+++ b/docs/device_2tensor__fill_8h.html
@@ -237,7 +237,7 @@ Functions
|
template<typename Element , typename Layout > |
void | cutlass::reference::device::TensorFillIdentity (TensorView< Element, Layout > view) |
-
| Fills a tensor's digonal with 1 and 0 everywhere else. More...
|
+
| Fills a tensor's diagonal with 1 and 0 everywhere else. More...
|
|
template<typename Element , typename Layout > |
void | cutlass::reference::device::TensorUpdateDiagonal (TensorView< Element, Layout > view, Element diag=Element(1)) |
diff --git a/docs/device_2tensor__fill_8h_source.html b/docs/device_2tensor__fill_8h_source.html
index dd5debda..908ccba8 100644
--- a/docs/device_2tensor__fill_8h_source.html
+++ b/docs/device_2tensor__fill_8h_source.html
@@ -125,7 +125,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
Parameters structure.
Definition: device/tensor_fill.h:99
Kind kind
Active variant kind.
Definition: distribution.h:64
-
void TensorFillIdentity(TensorView< Element, Layout > view)
Fills a tensor's digonal with 1 and 0 everywhere else.
Definition: device/tensor_fill.h:630
+
void TensorFillIdentity(TensorView< Element, Layout > view)
Fills a tensor's diagonal with 1 and 0 everywhere else.
Definition: device/tensor_fill.h:630
CUTLASS_HOST_DEVICE TensorCoord const & extent() const
Returns the extent of the view (the size along each logical dimension).
Definition: tensor_view.h:167
Computes a random Gaussian distribution.
Definition: device/tensor_fill.h:645
diff --git a/docs/device_2tensor__foreach_8h_source.html b/docs/device_2tensor__foreach_8h_source.html
index 90c5402e..0380fa93 100644
--- a/docs/device_2tensor__foreach_8h_source.html
+++ b/docs/device_2tensor__foreach_8h_source.html
@@ -98,7 +98,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
device/tensor_foreach.h
-
Go to the documentation of this file. 38 template <
typename Func,
int Rank,
typename Params>
44 if (!grid_size || !block_size) {
47 cudaError_t result = cudaOccupancyMaxPotentialBlockSize(
50 reinterpret_cast<void const *>(kernel::TensorForEach<Func, Rank, Params>));
52 if (result != cudaSuccess) {
53 throw std::runtime_error(
"Failed to query occupancy.");
58 block_size = (block_size < 128 ? block_size : 128);
61 dim3 grid(grid_size, 1, 1);
62 dim3 block(block_size, 1, 1);
64 kernel::TensorForEach<Func, Rank, Params><<< grid, block >>>(size, params);
71 template <
typename Func,
int Rank,
typename Params>
81 dim3 block(block_size, 1, 1);
82 dim3 grid((end - start + block_size - 1) / block_size, 1, 1);
84 kernel::TensorDiagonalForEach<Func, Rank, Params><<< grid, block >>>(size, params, start, end);
91 template <
typename Element,
typename Func>
98 typename Func::Params params =
typename Func::Params(),
100 int block_size = 0) {
102 if (!grid_size || !block_size) {
105 cudaError_t result = cudaOccupancyMaxPotentialBlockSize(
108 reinterpret_cast<void const *>(kernel::BlockForEach<Element, Func>));
110 if (result != cudaSuccess) {
111 throw std::runtime_error(
"Failed to query occupancy.");
116 block_size = (block_size < 128 ? block_size : 128);
119 dim3 grid(grid_size, 1, 1);
120 dim3 block(block_size, 1, 1);
122 kernel::BlockForEach<Element, Func><<< grid, block >>>(ptr, capacity, params);
Definition: aligned_buffer.h:35
+
Go to the documentation of this file. 38 template <
typename Func,
int Rank,
typename Params>
44 if (!grid_size || !block_size) {
47 cudaError_t result = cudaOccupancyMaxPotentialBlockSize(
50 reinterpret_cast<void const *>(kernel::TensorForEach<Func, Rank, Params>));
52 if (result != cudaSuccess) {
53 throw std::runtime_error(
"Failed to query occupancy.");
58 block_size = (block_size < 128 ? block_size : 128);
61 dim3 grid(grid_size, 1, 1);
62 dim3 block(block_size, 1, 1);
64 kernel::TensorForEach<Func, Rank, Params><<< grid, block >>>(size, params);
71 template <
typename Func,
int Rank,
typename Params>
81 dim3 block(block_size, 1, 1);
82 dim3 grid((end - start + block_size - 1) / block_size, 1, 1);
84 kernel::TensorDiagonalForEach<Func, Rank, Params><<< grid, block >>>(size, params, start, end);
91 template <
typename Element,
typename Func>
98 typename Func::Params params =
typename Func::Params(),
100 int block_size = 0) {
102 if (!grid_size || !block_size) {
105 cudaError_t result = cudaOccupancyMaxPotentialBlockSize(
108 reinterpret_cast<void const *>(kernel::BlockForEach<Element, Func>));
110 if (result != cudaSuccess) {
111 throw std::runtime_error(
"Failed to query occupancy.");
116 block_size = (block_size < 128 ? block_size : 128);
119 dim3 grid(grid_size, 1, 1);
120 dim3 block(block_size, 1, 1);
122 kernel::BlockForEach<Element, Func><<< grid, block >>>(ptr, capacity, params);
Definition: aligned_buffer.h:35
TensorDiagonalForEach(Coord< Rank > size, Params params=Params(), int start=0, int end=-1, int block_size=128)
Constructor performs the operation.
Definition: device/tensor_foreach.h:75
TensorForEach(Coord< Rank > size, Params params=Params(), int grid_size=0, int block_size=0)
Constructor performs the operation.
Definition: device/tensor_foreach.h:42
Launches a kernel calling a functor for each element along a tensor's diagonal.
Definition: device/tensor_foreach.h:72
diff --git a/docs/functions_func_s.html b/docs/functions_func_s.html
index 78581310..548b6d97 100644
--- a/docs/functions_func_s.html
+++ b/docs/functions_func_s.html
@@ -141,7 +141,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
Semaphore()
: cutlass::Semaphore
-
seperate_string()
+separate_string()
: cutlass::CommandLine
set()
diff --git a/docs/functions_s.html b/docs/functions_s.html
index 72c08b05..4b1ec8e0 100644
--- a/docs/functions_s.html
+++ b/docs/functions_s.html
@@ -172,7 +172,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
Semaphore()
: cutlass::Semaphore
-
seperate_string()
+separate_string()
: cutlass::CommandLine
sequential
diff --git a/docs/hierarchy.html b/docs/hierarchy.html
index 89ef0802..c9a5c204 100644
--- a/docs/hierarchy.html
+++ b/docs/hierarchy.html
@@ -312,23 +312,23 @@ This inheritance list is sorted roughly, but not completely, alphabetically: Ccutlass::reference::host::Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, InnerProductOp > | |
Ccutlass::reference::device::Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAdd > | Partial specialization for multiply-add |
Ccutlass::reference::device::Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAddSaturate > | Partial specialization for multiply-add-saturate |
- Ccutlass::reference::device::Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpXorPopc > | Parital specialization for XOR-popc |
+ Ccutlass::reference::device::Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpXorPopc > | Partial specialization for XOR-popc |
Ccutlass::reference::host::Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAdd > | Partial specialization for multiply-add |
Ccutlass::reference::host::Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAddSaturate > | Partial specialization for multiply-add-saturate |
- Ccutlass::reference::host::Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpXorPopc > | Parital specialization for XOR-popc |
- Ccutlass::gemm::device::Gemm< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, IsBetaZero > | Parital specialization for column-major output exchanges problem size and operand |
+ Ccutlass::reference::host::Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpXorPopc > | Partial specialization for XOR-popc |
+ Ccutlass::gemm::device::Gemm< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, IsBetaZero > | Partial specialization for column-major output exchanges problem size and operand |
Ccutlass::gemm::device::Gemm< ElementB, typename layout::LayoutTranspose< LayoutB >::type, ElementA, typename layout::LayoutTranspose< LayoutA >::type, ElementC, layout::RowMajor, ElementAccumulator, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, Stages, kAlignmentB, kAlignmentA, SplitKSerial, Operator, kIsBetaZero > | |
Ccutlass::library::GemmArguments | Arguments for GEMM |
Ccutlass::library::GemmArrayArguments | Arguments for GEMM - used by all the GEMM operations |
Ccutlass::library::GemmArrayConfiguration | Configuration for batched GEMM in which multiple matrix products are computed |
Ccutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ > | |
Ccutlass::gemm::kernel::GemmBatched< Mma_, Epilogue_, ThreadblockSwizzle_ > | |
- Ccutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ > | Parital specialization for column-major output exchanges problem size and operand |
+ Ccutlass::gemm::device::GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ > | Partial specialization for column-major output exchanges problem size and operand |
Ccutlass::gemm::device::GemmBatched< ElementB, typename layout::LayoutTranspose< LayoutB >::type, ElementA, typename layout::LayoutTranspose< LayoutA >::type, ElementC, layout::RowMajor, ElementAccumulator, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, Stages, kAlignmentB, kAlignmentA > | |
Ccutlass::library::GemmBatchedConfiguration | Configuration for batched GEMM in which multiple matrix products are computed |
Ccutlass::gemm::threadblock::GemmBatchedIdentityThreadblockSwizzle | Threadblock swizzling function for batched GEMMs |
Ccutlass::gemm::device::GemmComplex< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial > | |
- Ccutlass::gemm::device::GemmComplex< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial > | Parital specialization for column-major output exchanges problem size and operand |
+ Ccutlass::gemm::device::GemmComplex< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial > | Partial specialization for column-major output exchanges problem size and operand |
Ccutlass::gemm::device::GemmComplex< ElementB, typename layout::LayoutTranspose< LayoutB >::type, ElementA, typename layout::LayoutTranspose< LayoutA >::type, ElementC, layout::RowMajor, ElementAccumulator, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, Stages, TransformA, TransformB, SplitKSerial > | |
Ccutlass::library::GemmConfiguration | Configuration for basic GEMM operations |
Ccutlass::gemm::threadblock::GemmHorizontalThreadblockSwizzle | Threadblock swizzling function for GEMMs |
diff --git a/docs/host_2tensor__fill_8h.html b/docs/host_2tensor__fill_8h.html
index cc013443..0bc5fab0 100644
--- a/docs/host_2tensor__fill_8h.html
+++ b/docs/host_2tensor__fill_8h.html
@@ -192,7 +192,7 @@ Functions
|
template<typename Element , typename Layout > |
void | cutlass::reference::host::TensorFillIdentity (TensorView< Element, Layout > dst) |
- | Helper to fill a tensor's digonal with 1 and 0 everywhere else. More...
|
+ | Helper to fill a tensor's diagonal with 1 and 0 everywhere else. More...
|
|
template<typename Element , typename Layout > |
void | cutlass::reference::host::TensorUpdateDiagonal (TensorView< Element, Layout > dst, Element val=Element(1)) |
diff --git a/docs/host_2tensor__fill_8h_source.html b/docs/host_2tensor__fill_8h_source.html
index d90cc56a..12919ee0 100644
--- a/docs/host_2tensor__fill_8h_source.html
+++ b/docs/host_2tensor__fill_8h_source.html
@@ -132,7 +132,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
int int_scale
Definition: host/tensor_fill.h:115
< Layout function
Definition: host/tensor_fill.h:597
-void TensorFillIdentity(TensorView< Element, Layout > dst)
Helper to fill a tensor's digonal with 1 and 0 everywhere else.
Definition: host/tensor_fill.h:564
+void TensorFillIdentity(TensorView< Element, Layout > dst)
Helper to fill a tensor's diagonal with 1 and 0 everywhere else.
Definition: host/tensor_fill.h:564
CUTLASS_HOST_DEVICE complex< T > log(complex< T > const &z)
Computes the complex exponential of z.
Definition: complex.h:381
void operator()(Coord< Layout::kRank > const &coord) const
Compute random value and update RNG state.
Definition: host/tensor_fill.h:236
diff --git a/docs/host_2tensor__foreach_8h_source.html b/docs/host_2tensor__foreach_8h_source.html
index 34c54721..4dee3539 100644
--- a/docs/host_2tensor__foreach_8h_source.html
+++ b/docs/host_2tensor__foreach_8h_source.html
@@ -104,7 +104,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
static int const kActiveRank
Index of the active rank.
Definition: host/tensor_foreach.h:44
TensorForEachHelper(Func &func, Coord< Rank > const &extent, Coord< Rank > &coord)
Constructor for general rank.
Definition: host/tensor_foreach.h:47
Helper to perform for-each operation.
Definition: host/tensor_foreach.h:41
-TensorForEachHelper(Func &func, Coord< Rank > const &extent, Coord< Rank > &coord)
Constructor for fastest chaning rank.
Definition: host/tensor_foreach.h:67
+TensorForEachHelper(Func &func, Coord< Rank > const &extent, Coord< Rank > &coord)
Constructor for fastest changing rank.
Definition: host/tensor_foreach.h:67
Statically-sized array specifying Coords within a tensor.
Definition: coord.h:43
void TensorForEachLambda(Coord< Rank > extent, Func func)
Iterates over the index space of a tensor and calls a C++ lambda.
Definition: host/tensor_foreach.h:98
void TensorForEach(Coord< Rank > extent, Func &func)
Iterates over the index space of a tensor.
Definition: host/tensor_foreach.h:87
diff --git a/docs/include_2cutlass_2gemm_2device_2gemm_8h.html b/docs/include_2cutlass_2gemm_2device_2gemm_8h.html
index f1e5b09a..c3d18abc 100644
--- a/docs/include_2cutlass_2gemm_2device_2gemm_8h.html
+++ b/docs/include_2cutlass_2gemm_2device_2gemm_8h.html
@@ -130,7 +130,7 @@ Classes
| Argument structure. More...
|
|
class | cutlass::gemm::device::Gemm< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, IsBetaZero > |
- | Parital specialization for column-major output exchanges problem size and operand. More...
|
+ | Partial specialization for column-major output exchanges problem size and operand. More...
|
|
struct | cutlass::gemm::device::Gemm< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, IsBetaZero >::Arguments |
| Argument structure. More...
|
diff --git a/docs/include_2cutlass_2gemm_2device_2gemm__complex_8h.html b/docs/include_2cutlass_2gemm_2device_2gemm__complex_8h.html
index 7f21249e..fffdd34a 100644
--- a/docs/include_2cutlass_2gemm_2device_2gemm__complex_8h.html
+++ b/docs/include_2cutlass_2gemm_2device_2gemm__complex_8h.html
@@ -130,7 +130,7 @@ Classes
| Argument structure. More...
|
|
class | cutlass::gemm::device::GemmComplex< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial > |
- | Parital specialization for column-major output exchanges problem size and operand. More...
|
+ | Partial specialization for column-major output exchanges problem size and operand. More...
|
|
struct | cutlass::gemm::device::GemmComplex< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial >::Arguments |
| Argument structure. More...
|
diff --git a/docs/mma__pipelined_8h_source.html b/docs/mma__pipelined_8h_source.html
index d4952ffb..08ea72e9 100644
--- a/docs/mma__pipelined_8h_source.html
+++ b/docs/mma__pipelined_8h_source.html
@@ -98,7 +98,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
mma_pipelined.h
-
Go to the documentation of this file. 46 namespace threadblock {
59 typename SmemIteratorA_,
65 typename SmemIteratorB_,
73 typename TransformA_ = NumericArrayConverter<
74 typename SmemIteratorA_::Element,
75 typename IteratorA_::Element,
76 IteratorA_::Fragment::kElements>,
79 typename TransformB_ = NumericArrayConverter<
80 typename SmemIteratorB_::Element,
81 typename IteratorB_::Element,
82 IteratorB_::Fragment::kElements>,
84 typename Enable =
bool 126 using WarpFragmentA =
typename Operator::FragmentA;
127 using WarpFragmentB =
typename Operator::FragmentB;
142 typename Base::SharedStorage &shared_storage,
147 Base(shared_storage, thread_idx, warp_idx, lane_idx),
148 smem_iterator_A_(shared_storage.operand_A_ref(), thread_idx),
149 smem_iterator_B_(shared_storage.operand_B_ref(), thread_idx) {
171 int gemm_k_iterations,
193 iterator_A.load(tb_frag_A);
194 iterator_B.load(tb_frag_B);
199 this->smem_iterator_A_.store(transform_A(tb_frag_A));
200 this->smem_iterator_B_.store(transform_B(tb_frag_B));
208 WarpFragmentA warp_frag_A[2];
209 WarpFragmentB warp_frag_B[2];
222 int smem_write_stage_idx = 1;
225 if (gemm_k_iterations <= 1) {
226 iterator_A.clear_mask();
227 iterator_B.clear_mask();
239 for (; gemm_k_iterations > 0; --gemm_k_iterations) {
250 if (warp_mma_k == Base::kWarpGemmIterations - 1) {
253 this->smem_iterator_A_.store(transform_A(tb_frag_A));
255 this->smem_iterator_B_.store(transform_B(tb_frag_B));
263 if (smem_write_stage_idx == 1) {
269 {0, -
Base::kStages * Policy::kPartitionsK * Base::kWarpGemmIterations});
275 smem_write_stage_idx ^= 1;
287 if (warp_mma_k == 0) {
289 iterator_A.load(tb_frag_A);
290 iterator_B.load(tb_frag_B);
296 if (gemm_k_iterations <= 2) {
297 iterator_A.clear_mask();
298 iterator_B.clear_mask();
302 warp_mma(accum, warp_frag_A[warp_mma_k % 2], warp_frag_B[warp_mma_k % 2], accum);
static int const kM
Definition: include/cutlass/gemm/gemm.h:58
+
Go to the documentation of this file. 46 namespace threadblock {
59 typename SmemIteratorA_,
65 typename SmemIteratorB_,
73 typename TransformA_ = NumericArrayConverter<
74 typename SmemIteratorA_::Element,
75 typename IteratorA_::Element,
76 IteratorA_::Fragment::kElements>,
79 typename TransformB_ = NumericArrayConverter<
80 typename SmemIteratorB_::Element,
81 typename IteratorB_::Element,
82 IteratorB_::Fragment::kElements>,
84 typename Enable =
bool 126 using WarpFragmentA =
typename Operator::FragmentA;
127 using WarpFragmentB =
typename Operator::FragmentB;
142 typename Base::SharedStorage &shared_storage,
147 Base(shared_storage, thread_idx, warp_idx, lane_idx),
148 smem_iterator_A_(shared_storage.operand_A_ref(), thread_idx),
149 smem_iterator_B_(shared_storage.operand_B_ref(), thread_idx) {
171 int gemm_k_iterations,
193 iterator_A.load(tb_frag_A);
194 iterator_B.load(tb_frag_B);
199 this->smem_iterator_A_.store(transform_A(tb_frag_A));
200 this->smem_iterator_B_.store(transform_B(tb_frag_B));
208 WarpFragmentA warp_frag_A[2];
209 WarpFragmentB warp_frag_B[2];
222 int smem_write_stage_idx = 1;
225 if (gemm_k_iterations <= 1) {
226 iterator_A.clear_mask();
227 iterator_B.clear_mask();
239 for (; gemm_k_iterations > 0; --gemm_k_iterations) {
250 if (warp_mma_k == Base::kWarpGemmIterations - 1) {
253 this->smem_iterator_A_.store(transform_A(tb_frag_A));
255 this->smem_iterator_B_.store(transform_B(tb_frag_B));
263 if (smem_write_stage_idx == 1) {
269 {0, -
Base::kStages * Policy::kPartitionsK * Base::kWarpGemmIterations});
275 smem_write_stage_idx ^= 1;
287 if (warp_mma_k == 0) {
289 iterator_A.load(tb_frag_A);
290 iterator_B.load(tb_frag_B);
296 if (gemm_k_iterations <= 2) {
297 iterator_A.clear_mask();
298 iterator_B.clear_mask();
302 warp_mma(accum, warp_frag_A[warp_mma_k % 2], warp_frag_B[warp_mma_k % 2], accum);
static int const kM
Definition: include/cutlass/gemm/gemm.h:58
LayoutC_ LayoutC
Layout of accumulator matrix.
Definition: mma_pipelined.h:96
TransformB_ TransformB
Definition: mma_pipelined.h:103
Definition: aligned_buffer.h:35
diff --git a/docs/namespacecutlass_1_1gemm_1_1device.html b/docs/namespacecutlass_1_1gemm_1_1device.html
index 7023f4f8..c0b27cbe 100644
--- a/docs/namespacecutlass_1_1gemm_1_1device.html
+++ b/docs/namespacecutlass_1_1gemm_1_1device.html
@@ -134,17 +134,17 @@ Classes
class | Gemm |
|
class | Gemm< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, IsBetaZero > |
-
| Parital specialization for column-major output exchanges problem size and operand. More...
|
+
| Partial specialization for column-major output exchanges problem size and operand. More...
|
|
class | GemmBatched |
|
class | GemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ > |
-
| Parital specialization for column-major output exchanges problem size and operand. More...
|
+
| Partial specialization for column-major output exchanges problem size and operand. More...
|
|
class | GemmComplex |
|
class | GemmComplex< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial > |
-
| Parital specialization for column-major output exchanges problem size and operand. More...
|
+
| Partial specialization for column-major output exchanges problem size and operand. More...
|
|
class | GemmSplitKParallel |
|
diff --git a/docs/namespacecutlass_1_1reference_1_1device.html b/docs/namespacecutlass_1_1reference_1_1device.html
index 86f21a00..54f5009f 100644
--- a/docs/namespacecutlass_1_1reference_1_1device.html
+++ b/docs/namespacecutlass_1_1reference_1_1device.html
@@ -125,7 +125,7 @@ Classes
| Partial specialization for multiply-add-saturate. More...
|
|
struct | Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpXorPopc > |
-
| Parital specialization for XOR-popc. More...
|
+
| Partial specialization for XOR-popc. More...
|
|
struct | TensorDiagonalForEach |
| Launches a kernel calling a functor for each element along a tensor's diagonal. More...
|
@@ -183,7 +183,7 @@ Functions
|
template<typename Element , typename Layout > |
void | TensorFillIdentity (TensorView< Element, Layout > view) |
-
| Fills a tensor's digonal with 1 and 0 everywhere else. More...
|
+
| Fills a tensor's diagonal with 1 and 0 everywhere else. More...
|
|
template<typename Element , typename Layout > |
void | TensorUpdateDiagonal (TensorView< Element, Layout > view, Element diag=Element(1)) |
diff --git a/docs/namespacecutlass_1_1reference_1_1host.html b/docs/namespacecutlass_1_1reference_1_1host.html
index 90f9a014..6d07d1f2 100644
--- a/docs/namespacecutlass_1_1reference_1_1host.html
+++ b/docs/namespacecutlass_1_1reference_1_1host.html
@@ -122,7 +122,7 @@ Classes
| Partial specialization for multiply-add-saturate. More...
|
|
struct | Gemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpXorPopc > |
-
| Parital specialization for XOR-popc. More...
|
+
| Partial specialization for XOR-popc. More...
|
|
|
|
template<typename Element , typename Layout > |
void | TensorFillIdentity (TensorView< Element, Layout > dst) |
- | Helper to fill a tensor's digonal with 1 and 0 everywhere else. More...
|
+ | Helper to fill a tensor's diagonal with 1 and 0 everywhere else. More...
|
|
template<typename Element , typename Layout > |
void | TensorUpdateDiagonal (TensorView< Element, Layout > dst, Element val=Element(1)) |
diff --git a/docs/search/all_12.js b/docs/search/all_12.js
index 0e091040..c9f8a45c 100644
--- a/docs/search/all_12.js
+++ b/docs/search/all_12.js
@@ -14,7 +14,7 @@ var searchData=
['semaphore',['Semaphore',['../classcutlass_1_1Semaphore.html',1,'cutlass']]],
['semaphore',['Semaphore',['../classcutlass_1_1Semaphore.html#a2ce4cd07fe773efa429f726cfbd98070',1,'cutlass::Semaphore::Semaphore()'],['../structcutlass_1_1gemm_1_1kernel_1_1Gemm_1_1Params.html#adec6d0c6d74e7f456196f453e302fbbb',1,'cutlass::gemm::kernel::Gemm::Params::semaphore()']]],
['semaphore_2eh',['semaphore.h',['../semaphore_8h.html',1,'']]],
- ['seperate_5fstring',['seperate_string',['../structcutlass_1_1CommandLine.html#a5f86e4b2bd8c44b739c83530d77c5590',1,'cutlass::CommandLine']]],
+ ['separate_5fstring',['separate_string',['../structcutlass_1_1CommandLine.html#a5f86e4b2bd8c44b739c83530d77c5590',1,'cutlass::CommandLine']]],
['sequential',['sequential',['../structcutlass_1_1Distribution.html#ab86d975567ef141ff82067b1f41cd3ee',1,'cutlass::Distribution::sequential()'],['../structcutlass_1_1Distribution.html#a499f4023e0d42356ce71d38cc32bf92aa39d3cf55e90573c8d1dfb483cfb410dc',1,'cutlass::Distribution::Sequential()']]],
['set',['set',['../classcutlass_1_1PredicateVector_1_1Iterator.html#aadfd039b5622098c9e46706a27122575',1,'cutlass::PredicateVector::Iterator::set()'],['../structcutlass_1_1PredicateVector.html#a062fa8a8df725ef08ced2ffcca8336af',1,'cutlass::PredicateVector::set()'],['../classcutlass_1_1SubbyteReference.html#a6473e57520d8ee7afbd95c1e1641e05a',1,'cutlass::SubbyteReference::set()']]],
['set_5fgaussian',['set_gaussian',['../structcutlass_1_1Distribution.html#ad594b5ec1d577e8ef03d4d808a8220b1',1,'cutlass::Distribution']]],
diff --git a/docs/search/functions_12.js b/docs/search/functions_12.js
index f2b3bff9..6648b431 100644
--- a/docs/search/functions_12.js
+++ b/docs/search/functions_12.js
@@ -3,7 +3,7 @@ var searchData=
['scalar_5fop',['scalar_op',['../structcutlass_1_1minimum_3_01Array_3_01T_00_01N_01_4_01_4.html#a4b42227184cb7c796460062c46a84b57',1,'cutlass::minimum< Array< T, N > >']]],
['scalario',['ScalarIO',['../structcutlass_1_1ScalarIO.html#ad4166575521254088bf6c6300c351714',1,'cutlass::ScalarIO::ScalarIO()'],['../structcutlass_1_1ScalarIO.html#a5227e1e9ed24326ad4f8dc94d186186f',1,'cutlass::ScalarIO::ScalarIO(T value)']]],
['semaphore',['Semaphore',['../classcutlass_1_1Semaphore.html#a2ce4cd07fe773efa429f726cfbd98070',1,'cutlass::Semaphore']]],
- ['seperate_5fstring',['seperate_string',['../structcutlass_1_1CommandLine.html#a5f86e4b2bd8c44b739c83530d77c5590',1,'cutlass::CommandLine']]],
+ ['separate_5fstring',['separate_string',['../structcutlass_1_1CommandLine.html#a5f86e4b2bd8c44b739c83530d77c5590',1,'cutlass::CommandLine']]],
['set',['set',['../classcutlass_1_1PredicateVector_1_1Iterator.html#aadfd039b5622098c9e46706a27122575',1,'cutlass::PredicateVector::Iterator::set()'],['../structcutlass_1_1PredicateVector.html#a062fa8a8df725ef08ced2ffcca8336af',1,'cutlass::PredicateVector::set()'],['../classcutlass_1_1SubbyteReference.html#a6473e57520d8ee7afbd95c1e1641e05a',1,'cutlass::SubbyteReference::set()']]],
['set_5fgaussian',['set_gaussian',['../structcutlass_1_1Distribution.html#ad594b5ec1d577e8ef03d4d808a8220b1',1,'cutlass::Distribution']]],
['set_5fidentity',['set_identity',['../structcutlass_1_1Distribution.html#aad2cf02af3d520544d89843cc4295858',1,'cutlass::Distribution']]],
diff --git a/docs/structcutlass_1_1CommandLine-members.html b/docs/structcutlass_1_1CommandLine-members.html
index 77668951..6a17b2f0 100644
--- a/docs/structcutlass_1_1CommandLine-members.html
+++ b/docs/structcutlass_1_1CommandLine-members.html
@@ -115,7 +115,7 @@ var searchBox = new SearchBox("searchBox", "search",false,'Search');
keys | cutlass::CommandLine | |
num_naked_args() const | cutlass::CommandLine | inline |
parsed_argc() const | cutlass::CommandLine | inline |
- seperate_string(std::string const &str, std::vector< value_t > &vals, char sep= ',') | cutlass::CommandLine | inlinestatic |
+ separate_string(std::string const &str, std::vector< value_t > &vals, char sep= ',') | cutlass::CommandLine | inlinestatic |
tokenize(std::vector< std::pair< std::string, std::string > > &tokens, std::string const &str, char delim= ',', char sep= ':') | cutlass::CommandLine | inlinestatic |
tokenize(std::vector< std::string > &tokens, std::string const &str, char delim= ',', char sep= ':') | cutlass::CommandLine | inlinestatic |
values | cutlass::CommandLine | |
diff --git a/docs/structcutlass_1_1CommandLine.html b/docs/structcutlass_1_1CommandLine.html
index 0bde0ec4..01cd35a2 100644
--- a/docs/structcutlass_1_1CommandLine.html
+++ b/docs/structcutlass_1_1CommandLine.html
@@ -151,7 +151,7 @@ Static Public Member Functions
| Tokenizes a comma-delimited list of string pairs delimited by ':'. More...
|
|
template<typename value_t > |
-static void | seperate_string (std::string const &str, std::vector< value_t > &vals, char sep= ',') |
+static void | separate_string (std::string const &str, std::vector< value_t > &vals, char sep= ',') |
|