Commit Graph

330 Commits

Author SHA1 Message Date
Haicheng Wu
8f5c242426
Update dual_gemm_common.h
fix the copyright of a new file.
2023-02-13 15:35:33 -05:00
Adnan Akhundov
3c995c7606
Extend DualGemm: support batched mode + decouple B0/B1 layouts (#790)
* Fix MHA kernel

Summary:

ATT

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

* Extend DualGemm to support batched mode (#5)

Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set.

* Decouple LayoutB0 and LayoutB1 in DualGemm

The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously.

In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above.

* Remove comment as no longer relevant

* Revert Fix MHA kernel

---------

Co-authored-by: mikeiovine <mikeiovine@fb.com>
2023-02-13 15:27:13 -05:00
Shuai Shao
ce8597dc14
Fix type bug in conv2d/gemm with broadcast (#796)
add ElementVector

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-09 20:53:25 -05:00
dan_the_3rd
2e10404d26
xFormer updates to fMHA FW (#773)
* xFormer updates to fMHA FW

* Convert format to BMHK for '41_fused_multi_head_attention_fixed_seqlen'

* Add missing files

* Remove xFormers specific code

* Update fused_multihead_attention_fixed_seqlen.cu

* rebase and solve conflicts

* remove white space

---------

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-02-08 23:00:10 -05:00
Jack Kosaian
5ff5209ed5
Add acc2smem in epilogue/threadblock/epilogue.h (#806) 2023-02-06 22:04:16 -05:00
Jack Kosaian
5921043981
Re-enable all alignments for int accumulators (#807) 2023-02-06 22:01:15 -05:00
Mark Hoemmen
add4ba622f
Fix 8.4 + CUDA 11.4 build (#789)
Work around a likely GCC 8.x issue with fold expressions
and generic lambdas.

Only use the work-around when the host compiler is GCC 8.x.
This avoids any concerns about the work-around possibly
hindering inlining for a critical CuTe function (product).

Users can experiment with the work-around for other compilers
or compiler versions by defining the following macro.

CUTE_FOLD_GENERIC_LAMBDA_WORKAROUND

Fixes https://github.com/NVIDIA/cutlass/issues/788

Co-authored-by: Mark Hoemmen <mhoemmen@nvidia.com>
2023-01-27 09:18:59 -05:00
Vijay Thakkar
277bd6e537
CUTLASS 3.0.0 (#786)
* CUTLASS 3.0.0
2023-01-23 20:55:28 -05:00
ANIKET SHIVAM
66d9cddc83
New updates for 2.11 (#775)
* New updates.

* Minor profiler updates

Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2023-01-20 16:32:57 -05:00
psaab
d49bef88f9
Enable aarch64 support (#779) 2023-01-20 15:51:58 -05:00
Haicheng Wu
8b42e751c6
streamk paper link (#765)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-10 22:10:43 -05:00
Muhammad Osama
eb7f99d3dd
@hwu36 Adding the individual arXiv link for Stream-K paper. (#764)
* Stream-K individual paper entry.

* arXiv links updated.
2023-01-10 20:39:06 -05:00
Haicheng Wu
764b840d6f
streamk example and performance tuning (#760)
* streamk example and performance tuning

* one missing file

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-10 16:10:02 -05:00
Ali Hassani
a1046d49c1
Adds missing semicolon (#759) 2023-01-09 21:50:46 -05:00
Haicheng Wu
1cd994b4cf
Update PUBLICATIONS.md
@neoblizz @dumerrill 

thesis covering streamk
2023-01-08 00:42:19 -05:00
Gregory Meyer (gregjm)
7bdba07310
Add definitions for tag structs. (#752)
This commit changes the declarations of MMA operator class (SIMT, Tensor Core, WMMA Tensor Core) and operator type (multiply-add and so on) to definitions. This is done so that these tag structs are no longer incomplete types, which allows the `typeid` operator to be used on these tag structs. This is necessary for these tag structs to be used as type parameters in [GoogleTest typed tests](https://google.github.io/googletest/advanced.html#typed-tests).
2023-01-06 09:46:52 -05:00
Gregory Meyer (gregjm)
c54ede3a9e
Add const overloads for iterator functions. (#753)
This commit adds `const`-correct overloads for `Array::{begin,end,rbegin,rend}`. These overloads are necessary for usage with [the GMock Container Matchers](http://google.github.io/googletest/reference/matchers.html#container-matchers), which cast the `Container` argument to a constant reference.
2023-01-06 09:46:34 -05:00
Haicheng Wu
ff6e733fe1
restore the old epilogue for everything except streamk (#749)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2023-01-04 11:02:55 -05:00
Haicheng Wu
5989b7e1d7
Update PUBLICATIONS.md
Add coconet paper to the publication list.  @abhijangda
2023-01-04 09:18:38 -05:00
Haicheng Wu
1e64f153b3
improve streamk load balance (#743)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-12-25 13:56:33 -05:00
Matthew Nicely
78b30d3191
Update README.md 2022-12-21 11:58:19 -05:00
Matthew Nicely
59de82688b
Update README.md 2022-12-21 11:57:55 -05:00
Gregory Meyer (gregjm)
b85865d1ad
Add missing #include directives (#741)
This commit adds two `#include` directives so that the definitions of `cutlass::gemm::warp::WarpSize` from "cutlass/gemm/warp/mma.h" and `cutlass::arch::OpClassSimt` from "cutlass/arch/mma.h" are visible to "cutlass/epilogue/threadblock/default_epilogue_simt.h". Without them, there are compiler errors when building the header standalone:

```
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:32: error: no member named 'warp' in namespace 'cutlass::gemm'; did you mean simply 'warp'?
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                               ^
./cutlass/include/cutlass/epilogue/warp/tile_iterator_simt.h:49:11: note: 'warp' declared here
namespace warp {
          ^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:53: error: no member named 'WarpSize' in namespace 'cutlass::epilogue::warp'
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                                              ~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:68: error: no member named 'OpClassSimt' in namespace 'cutlass::arch'
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                                                             ~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:82: error: no member named 'value' in the global namespace
  static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
                                                                               ~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:367:5: error: use of class template 'OutputTileThreadMap' requires template arguments
    OutputTileThreadMap,
    ^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
       ^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:391:5: error: use of class template 'OutputTileThreadMap' requires template arguments
    OutputTileThreadMap,
    ^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
       ^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:405:5: error: unknown type name 'OutputTileIterator'; did you mean 'WarpTileIterator'?
    OutputTileIterator,
    ^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:380:9: note: 'WarpTileIterator' declared here
  using WarpTileIterator = cutlass::epilogue::warp::TileIteratorSimtDirect2dConv<
        ^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:408:5: error: use of class template 'SharedLoadIterator' requires template arguments
    SharedLoadIterator,
    ^
./cutlass/include/cutlass/epilogue/threadblock/shared_load_iterator.h:67:7: note: template is declared here
class SharedLoadIterator {
      ^
```
2022-12-21 11:40:20 -05:00
Haicheng Wu
3f2bb17722
minor chagnes (#730)
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-12-10 14:44:53 -05:00
ANIKET SHIVAM
38193d76e3
Updates for stream-k (#728)
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-12-08 23:48:10 -05:00
Gregory Meyer (gregjm)
1d7772f218
Add missing #include directive (#727) 2022-12-08 18:58:31 -05:00
Jack Kosaian
df81d847d7
Make Python interface work for non-SM80 targets (#726)
* Make Python interface work for non-SM80 targets

* Remove line in README
2022-12-07 21:53:33 -05:00
Mike Iovine
d6117ca362
Relax stream K gemm alignment constraints (#717)
* Relax stream K gemm alignment constraints

The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.

* Revert "Relax stream K gemm alignment constraints"

This reverts commit 31e80a250e2b0ac4bda2e4b437b39dc5bcd5e845.

* Relax stream K gemm alignment constraints

The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-12-07 11:17:49 -05:00
Ali Hassani
9c0518608e
Fix typos in conv problem sizes (#720)
* Fix typos in conv problem sizes

* Typos
2022-12-05 15:54:58 -05:00
Haicheng Wu
9f1f37aa21
misc (#719)
* misc

* minor

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-12-05 12:07:20 -05:00
Wenzhuo Liu
84213b0b8e
fix: make arch.h self contained (#714) 2022-12-01 19:25:48 -05:00
tpoisonooo
8567b87d65
Update quickstart.md (#704)
* Update quickstart.md

* Update doxygen_mainpage.md

* Update doxygen_mainpage.md

* Update terminology.md
2022-11-29 21:43:03 -05:00
Aditya Atluri
c975e2ccbb
releaase 2.11 (#703) 2022-11-19 09:02:15 -05:00
Wenzhuo Liu
3c90f6aea6
add #pragma once for header file in example 42 (#698) 2022-11-15 22:50:24 -05:00
seventh
06eb90cc0d
Fix identity sigmoid activation (#659)
* activation support Identity

* fix Sigmoid activation operator() with CUTLASS_HOST_DEVICE
2022-11-09 14:42:23 -05:00
seventh
168ea8b0e1
ensure singleton::get thread safe construct instance (#658)
* ensure singleton::get thread safe construct instance

* fix singleton return reference

Co-authored-by: xuweiqi <xuweiqi117@gmail.com>
2022-11-08 21:44:32 -05:00
Haicheng Wu
012c62c748
bug fixes and enharcement to gemm reductionK fusion (#682)
* add two missing files

* fix bunch of bugs of gemm-reducek fusion and add a device interface

* small changes

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-11-03 11:07:50 -04:00
FZC
cc85b64cf6
fix typo (#677) 2022-11-01 14:07:33 -04:00
dan_the_3rd
1b4e24470a
Example 43 - DualGemm (#670)
* Ex50 wip

* IS_PROFILING mode

* MultiStage2 - but is slower

* Add SwiGLU

* Support SplitKSerial reduction
Support not storing D0/D1
Cleanup code

* Option to disable bias

* Renumber example

* Fix build

* Remove references to pb_size_0 / pb_size_1

* Add support for bf16 inputs with float accum

* small changes

Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-10-26 14:04:42 -04:00
Jack Kosaian
8c1bf9b784
Bump CUTLASS Python container version (#672)
* Update example 40 README

* Update CUTLASS Python README
2022-10-22 21:09:39 -04:00
Yuriy Chernyshov
7d0dd6706e
Remove excessive includes from examples/41_multi_head_attention (#669)
The rationale behind this change is explained in #563
2022-10-21 22:23:15 -04:00
hlu1
9b47403b2d
Add missing CUTLASS_HOST_DEVICE (#671) 2022-10-21 22:20:38 -04:00
dan_the_3rd
4db6a6140e
ex42: Fused MHA imported from xFormers (#662)
* ex42: Fused MHA imported from xFormers

* Remove std:: references

* Support K>128 in the example

* Support causal option

* Support different head size for V, and different seqlength for KV

* Update FLOPS counter

* Remove bit_cast

* fix build: Replace M_LOG2E

* Add doc

* Revert "Remove bit_cast"

This reverts commit 9662fa86bb7c57c1a015ac0bf52cb52940fbbf80.

* Explicit casts to int32_t for windows build

Co-authored-by: danthe3rd <danthe3rd>
2022-10-17 10:49:33 -04:00
Matthew Nicely
3bf95e90c2
Update labeler.yml 2022-10-13 08:03:28 -04:00
Matthew Nicely
75fed7493e
Update labeler.yml 2022-10-13 08:01:21 -04:00
Matthew Nicely
98b73fc95d
Update labeler.yml 2022-10-13 07:55:33 -04:00
Matthew Nicely
4990e3686d
Update labeler.yml 2022-10-13 07:52:38 -04:00
Matthew Nicely
4b7365388c
Update labeler.yml 2022-10-13 07:32:55 -04:00
Matthew Nicely
0d8405588d
Update labeler.yml 2022-10-12 15:32:38 -04:00
Alexander Freudenberg
cb539dab78
Correct typos in comments (#639)
* Correct typos in comments

Correct comments in code on type of generated distribution. Improve Gaussian RNG to take advantage of Box Muller method

* Inline Box Muller

Added inline function for the Box Muller algorithm and updated code comments to be more concise

* Update tensor_fill.h

* Update tensor_fill.h

* small changes to pass tests

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-09-30 22:51:30 -04:00