* xFormer updates to fMHA FW
* Convert format to BMHK for '41_fused_multi_head_attention_fixed_seqlen'
* Add missing files
* Remove xFormers specific code
* Update fused_multihead_attention_fixed_seqlen.cu
* rebase and solve conflicts
* remove white space
---------
Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* add two missing files
* fix bunch of bugs of gemm-reducek fusion and add a device interface
* small changes
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* ex42: Fused MHA imported from xFormers
* Remove std:: references
* Support K>128 in the example
* Support causal option
* Support different head size for V, and different seqlength for KV
* Update FLOPS counter
* Remove bit_cast
* fix build: Replace M_LOG2E
* Add doc
* Revert "Remove bit_cast"
This reverts commit 9662fa86bb7c57c1a015ac0bf52cb52940fbbf80.
* Explicit casts to int32_t for windows build
Co-authored-by: danthe3rd <danthe3rd>
* Remove redundant <fstream> includes
* Fix fstream in examples/
* Fix <fstream> in test/
* Use consistent order for <fstream> (always after <iostream>)
* Remove an unneeded include in a file where std::ofstream usage is commented out
Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
* add split k wgrad example
* wgrad done
* begin transposed conv2d example
* update transposed conv2d example and add ref check
* update doc for conv2d transpose example
* add license
* add wgrad doc
* more clarification on GEMM output type
* typo fix
* clean up indent
* address comments
* rename example numbers to 34 and 35
* GEMM -> Implicit GEMM
* Revert "rename example numbers to 34 and 35"
This reverts commit 551a808c227216e9e38d4472ba8ff020557b8500.
* transposed_conv2d is 34
* add compiler and device version check to exit gracefully
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
When split-k is enabled, we should set alpha to 1 and beta to 0 for the
split-k gemm kernel.
The fix was from hwu36. I only did fixed some minor typos along with his
fix.
* Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels
* Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code
* std => platform
* fix affine2
* really fix affine2
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray
* Add a reference to GemmArray to the docs
Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
CUTLASS 2.7
Mainloop fusion for GEMM: summation over A or B
Strided DGRAD (optimized iterators)
Half-precision GELU_taylor activation functions
Use these when accumulation and epilogue compute types are all cutlass::half_t
Tuning and bug fixes to fused GEMM + GEMM example
Support for smaller than 128b aligned Convolutions: see examples
Caching of results to accelerate Convolution unit tests
Can be enabled or disabled by running cmake .. -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=OFF
Corrections and bug fixes reported by the CUTLASS community
Thank you for filing these issues!
authored-by: Haicheng Wu haichengw@nvidia.com, Manish Gupta manigupta@nvidia.com, Dustyn Blasig dblasig@nvidia.com, Andrew Kerr akerr@nvidia.com