* Fix MHA kernel
Summary:
ATT
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* Extend DualGemm to support batched mode (#5)
Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set.
* Decouple LayoutB0 and LayoutB1 in DualGemm
The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously.
In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above.
* Remove comment as no longer relevant
* Revert Fix MHA kernel
---------
Co-authored-by: mikeiovine <mikeiovine@fb.com>
* xFormer updates to fMHA FW
* Convert format to BMHK for '41_fused_multi_head_attention_fixed_seqlen'
* Add missing files
* Remove xFormers specific code
* Update fused_multihead_attention_fixed_seqlen.cu
* rebase and solve conflicts
* remove white space
---------
Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Work around a likely GCC 8.x issue with fold expressions
and generic lambdas.
Only use the work-around when the host compiler is GCC 8.x.
This avoids any concerns about the work-around possibly
hindering inlining for a critical CuTe function (product).
Users can experiment with the work-around for other compilers
or compiler versions by defining the following macro.
CUTE_FOLD_GENERIC_LAMBDA_WORKAROUND
Fixes https://github.com/NVIDIA/cutlass/issues/788
Co-authored-by: Mark Hoemmen <mhoemmen@nvidia.com>
This commit changes the declarations of MMA operator class (SIMT, Tensor Core, WMMA Tensor Core) and operator type (multiply-add and so on) to definitions. This is done so that these tag structs are no longer incomplete types, which allows the `typeid` operator to be used on these tag structs. This is necessary for these tag structs to be used as type parameters in [GoogleTest typed tests](https://google.github.io/googletest/advanced.html#typed-tests).
This commit adds two `#include` directives so that the definitions of `cutlass::gemm::warp::WarpSize` from "cutlass/gemm/warp/mma.h" and `cutlass::arch::OpClassSimt` from "cutlass/arch/mma.h" are visible to "cutlass/epilogue/threadblock/default_epilogue_simt.h". Without them, there are compiler errors when building the header standalone:
```
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:32: error: no member named 'warp' in namespace 'cutlass::gemm'; did you mean simply 'warp'?
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
^
./cutlass/include/cutlass/epilogue/warp/tile_iterator_simt.h:49:11: note: 'warp' declared here
namespace warp {
^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:53: error: no member named 'WarpSize' in namespace 'cutlass::epilogue::warp'
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:68: error: no member named 'OpClassSimt' in namespace 'cutlass::arch'
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
~~~~~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:351:82: error: no member named 'value' in the global namespace
static int const kWarpSize = cutlass::gemm::warp::WarpSize<arch::OpClassSimt>::value;
~~^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:367:5: error: use of class template 'OutputTileThreadMap' requires template arguments
OutputTileThreadMap,
^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:391:5: error: use of class template 'OutputTileThreadMap' requires template arguments
OutputTileThreadMap,
^
./cutlass/include/cutlass/epilogue/threadblock/output_tile_thread_map.h:134:8: note: template is declared here
struct OutputTileThreadMap : public OutputTileThreadMapHelpers<Iterations_, Delta_> {
^
In file included from cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.cu:1:
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:405:5: error: unknown type name 'OutputTileIterator'; did you mean 'WarpTileIterator'?
OutputTileIterator,
^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:380:9: note: 'WarpTileIterator' declared here
using WarpTileIterator = cutlass::epilogue::warp::TileIteratorSimtDirect2dConv<
^
./cutlass/include/cutlass/epilogue/threadblock/default_epilogue_simt.h:408:5: error: use of class template 'SharedLoadIterator' requires template arguments
SharedLoadIterator,
^
./cutlass/include/cutlass/epilogue/threadblock/shared_load_iterator.h:67:7: note: template is declared here
class SharedLoadIterator {
^
```
* Relax stream K gemm alignment constraints
The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.
* Revert "Relax stream K gemm alignment constraints"
This reverts commit 31e80a250e2b0ac4bda2e4b437b39dc5bcd5e845.
* Relax stream K gemm alignment constraints
The current alignment requirements are too strict. Make them identical
to the checks for the regular universal gemm.
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* add two missing files
* fix bunch of bugs of gemm-reducek fusion and add a device interface
* small changes
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* ex42: Fused MHA imported from xFormers
* Remove std:: references
* Support K>128 in the example
* Support causal option
* Support different head size for V, and different seqlength for KV
* Update FLOPS counter
* Remove bit_cast
* fix build: Replace M_LOG2E
* Add doc
* Revert "Remove bit_cast"
This reverts commit 9662fa86bb7c57c1a015ac0bf52cb52940fbbf80.
* Explicit casts to int32_t for windows build
Co-authored-by: danthe3rd <danthe3rd>