* Fix MHA kernel
Summary:
ATT
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* Extend DualGemm to support batched mode (#5)
Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set.
* Decouple LayoutB0 and LayoutB1 in DualGemm
The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously.
In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above.
* Remove comment as no longer relevant
* Revert Fix MHA kernel
---------
Co-authored-by: mikeiovine <mikeiovine@fb.com>
* xFormer updates to fMHA FW
* Convert format to BMHK for '41_fused_multi_head_attention_fixed_seqlen'
* Add missing files
* Remove xFormers specific code
* Update fused_multihead_attention_fixed_seqlen.cu
* rebase and solve conflicts
* remove white space
---------
Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* add two missing files
* fix bunch of bugs of gemm-reducek fusion and add a device interface
* small changes
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* ex42: Fused MHA imported from xFormers
* Remove std:: references
* Support K>128 in the example
* Support causal option
* Support different head size for V, and different seqlength for KV
* Update FLOPS counter
* Remove bit_cast
* fix build: Replace M_LOG2E
* Add doc
* Revert "Remove bit_cast"
This reverts commit 9662fa86bb7c57c1a015ac0bf52cb52940fbbf80.
* Explicit casts to int32_t for windows build
Co-authored-by: danthe3rd <danthe3rd>
* Remove redundant <fstream> includes
* Fix fstream in examples/
* Fix <fstream> in test/
* Use consistent order for <fstream> (always after <iostream>)
* Remove an unneeded include in a file where std::ofstream usage is commented out
Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
* add split k wgrad example
* wgrad done
* begin transposed conv2d example
* update transposed conv2d example and add ref check
* update doc for conv2d transpose example
* add license
* add wgrad doc
* more clarification on GEMM output type
* typo fix
* clean up indent
* address comments
* rename example numbers to 34 and 35
* GEMM -> Implicit GEMM
* Revert "rename example numbers to 34 and 35"
This reverts commit 551a808c227216e9e38d4472ba8ff020557b8500.
* transposed_conv2d is 34
* add compiler and device version check to exit gracefully
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
When split-k is enabled, we should set alpha to 1 and beta to 0 for the
split-k gemm kernel.
The fix was from hwu36. I only did fixed some minor typos along with his
fix.
* Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels
* Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code
* std => platform
* fix affine2
* really fix affine2
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray
* Add a reference to GemmArray to the docs
Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
CUTLASS 2.7
Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
Thank you for filing these issues!
authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com