Haicheng Wu
012c62c748
bug fixes and enharcement to gemm reductionK fusion ( #682 )
...
* add two missing files
* fix bunch of bugs of gemm-reducek fusion and add a device interface
* small changes
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-11-03 11:07:50 -04:00
dan_the_3rd
1b4e24470a
Example 43 - DualGemm ( #670 )
...
* Ex50 wip
* IS_PROFILING mode
* MultiStage2 - but is slower
* Add SwiGLU
* Support SplitKSerial reduction
Support not storing D0/D1
Cleanup code
* Option to disable bias
* Renumber example
* Fix build
* Remove references to pb_size_0 / pb_size_1
* Add support for bf16 inputs with float accum
* small changes
Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-10-26 14:04:42 -04:00
hlu1
9b47403b2d
Add missing CUTLASS_HOST_DEVICE ( #671 )
2022-10-21 22:20:38 -04:00
dan_the_3rd
4db6a6140e
ex42: Fused MHA imported from xFormers ( #662 )
...
* ex42: Fused MHA imported from xFormers
* Remove std:: references
* Support K>128 in the example
* Support causal option
* Support different head size for V, and different seqlength for KV
* Update FLOPS counter
* Remove bit_cast
* fix build: Replace M_LOG2E
* Add doc
* Revert "Remove bit_cast"
This reverts commit 9662fa86bb7c57c1a015ac0bf52cb52940fbbf80.
* Explicit casts to int32_t for windows build
Co-authored-by: danthe3rd <danthe3rd>
2022-10-17 10:49:33 -04:00
Ying Zhang
dadc881a96
Bug fix for gemm broadcast ( #650 )
...
* gemm_universal_with_broadcast, +2 sources.
* Revert "gemm_universal_with_broadcast, +2 sources."
This reverts commit fb063251f2144a091f12c9abfce7e1713f2d1c9e.
* gemm broadcast bug fix
2022-09-30 10:00:38 -04:00
Wenzhuo Liu
cd37e82492
change unused class member to local var ( #646 )
2022-09-28 23:52:35 -04:00
Wenzhuo Liu
7a458f00a6
fix(permute.h): incorrect comment in Tensor5DPermute20314
( #637 )
...
* fix(permute.h): incorrect comment in `Tensor5DPermute20314`
* typo in usage in example 39
2022-09-22 09:21:13 -04:00
Tianqi Zhang (张天启)
9f2e3faa69
fix call of GELU_Taylor in LinearCombinationGeneric ( #634 )
2022-09-20 21:00:55 -04:00
Ying Zhang
a821280dc7
Gemm broadcast ( #632 )
...
* gemm_universal_with_broadcast, +2 sources.
* Revert "gemm_universal_with_broadcast, +2 sources."
This reverts commit fb063251f2144a091f12c9abfce7e1713f2d1c9e.
* gemm_universal_with_broadcast separated version.
* Update copyright banner.
* update banner
2022-09-20 10:37:12 -04:00
Andrew Kerr
fc9ebc645b
CUTLASS 2.10 bug fixes and minor updates. ( #626 )
2022-09-15 16:20:33 -04:00
alexfreudenberg
2cc2c7ba1f
Add set_k_partition function ( #624 )
...
A member function set_k_partition is required for the instatiation of cutlass::gemm::kernel::Gemm, even though SplitKSerial is false
2022-09-13 22:34:20 -04:00
ANIKET SHIVAM
e773429f7e
CUTLASS 2.10 updates ( #622 )
...
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-09-12 21:26:30 -04:00
Jack Kosaian
f29d8f7ca9
Include vector in base_grouped.h ( #618 )
2022-09-06 13:21:23 -04:00
ANIKET SHIVAM
b72cbf957d
CUTLASS 2.10 ( #615 )
...
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
2022-09-03 18:48:46 -04:00
Cliff Burdick
ca23ff7924
Fixed typo in class name ( #608 )
2022-08-29 20:51:52 -04:00
Cliff Burdick
1c3d400b14
Added value_type
trait to complex to make it an easier drop-in replacement for std::complex. ( #607 )
2022-08-28 01:12:40 -04:00
Cliff Burdick
abafbf2afd
Missing comma in trmm header ( #604 )
2022-08-25 16:07:33 -04:00
Haicheng Wu
497b499d9d
Add residual support for shmem staging iterator used in back-to-back GEMM fusion. This allows support of problem_size_0_n that is not multiple of 32. ( #590 )
...
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-08-15 11:19:24 -04:00
dan_the_3rd
25ebf15d02
Ensure all arch::Mma specializations have ElementC set ( #576 )
...
Co-authored-by: danthe3rd <danthe3rd@users.noreply.github.com>
2022-07-22 23:53:03 -04:00
Haicheng Wu
e7a61c761a
fix race condition when h < stride_h or w < stride_w ( #562 )
...
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-07-12 16:37:08 -04:00
seventh
fb379eaa5b
epilogue leaky relu support ScaleType ( #564 )
...
Co-authored-by: xuweiqi <xuweiqi117@gmail.com>
2022-07-11 17:30:55 -04:00
Bing Xu
1eb6355182
[activation] tanh ( #550 )
...
Co-authored-by: Bing Xu <bingxu@fb.com>
2022-07-02 08:00:45 -04:00
Yujia Zhai
04a9777b87
Softmax ( #546 )
...
* add test layernorm g-mem version
* Delete include/configure directory
* Delete examples/test_layernorm directory
* Update gemm_with_softmax.h
* Update gemm_softmax.cu
* Update linear_combination.h
* Update fast_math.h
* remove redundant vars
Co-authored-by: yujia.zhai <yujia.zhai@bytedance.com>
Co-authored-by: yuzhai <yuzhai@nvidia.com>
2022-07-02 01:19:18 -04:00
Haicheng Wu
e45e773436
Update linear_combination_generic.h ( #472 )
...
add `skip_elementwise_` to support serial splitk in linear_combination_generic.h`
2022-06-28 07:29:38 -04:00
Haicheng Wu
9ab9110168
add leaky relu ( #542 )
...
Authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-06-26 10:07:50 -04:00
Jack Kosaian
fa56763c25
Fix occupancy calculation for grouped GEMM ( #532 )
2022-06-18 19:53:59 -04:00
LiuWei
25e26a6e51
fix bugs in linear_combination_generic.h missing include cutlass/epilogue/thread/scale_type.h ( #531 )
2022-06-17 23:35:14 -04:00
Pei Sun
dceefe4f64
Increment stride correctly in warp iterator. ( #516 )
...
Co-authored-by: peisun1115 <peis@google.com>
2022-06-06 12:33:36 -04:00
Pei Sun
c3881d097e
Fix a comment about LDSM layout. ( #514 )
...
Co-authored-by: peisun1115 <peis@google.com>
2022-06-04 23:04:00 -04:00
Pei Sun
a29dfb1c63
Fix a bug to increment stride tile correctly ( #503 )
...
* Fix a bug to increment stride tile correctly
* Update regular_tile_access_iterator_tensor_op.h
Co-authored-by: peisun1115 <peis@google.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2022-06-03 22:54:52 -04:00
Mike Iovine
c4cf0dad82
Fix init-self compiler warnings ( #493 )
...
Fix a few errors caused by trying to initialize a class member
with itself. These errors can turn into errors if you compile
with `-Winit-self`.
2022-05-11 00:35:28 -04:00
TonyZhao
ddd8f9cf41
update float < int32_t * 4 ( #488 )
...
Co-authored-by: 赵俊涛 <zhaojuntao@zhaojuntaos-MacBook-Pro.local>
2022-05-04 13:36:05 -04:00
Haicheng Wu
ec2b4fd85d
b2b bias vector support ( #482 )
...
* b2b bias vector support
* add files
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-04-30 04:16:15 -07:00
Stepan Tezyunichev
86ce09aed1
2.9 fixes for nvrtc ( #480 )
...
* Use platform::is_same instead of std::is_same
* Don't hide cuComplex include from nvrtc
* Typo fixed
* Remove comment rename
2022-04-29 09:06:52 -04:00
Janusz Lisiecki
8c339ac039
Fix compilation in clang ( #478 )
...
- adds missing commas
- adjusts misaligned usage of CUTLASS_DEVICE between
template declaration and specializations
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
2022-04-28 14:22:06 -04:00
Haicheng Wu
e49f690fd7
Update linear_combination_generic.h
2022-04-28 14:04:53 -04:00
Stepan Tezyunichev
71def2f084
Use platform:: instead of std::abs and std::conditional ( #452 )
...
* Fixed template struct/class mismatch
* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation
* Use platform implementation instead of std::abs and std::conditional during nvrtc compilation
* Revert absolute_value() usage
2022-04-25 14:40:22 -04:00
Fujun Han
dd77fadc70
Remove redundant offset def and init in shared_load_iterator.h ( #456 )
...
Signed-off-by: Fujun Han <fujun.han@iluvatar.ai>
2022-04-24 16:31:00 -04:00
Stepan Tezyunichev
be4578d517
Fixed template struct/class mismatch ( #453 )
2022-04-24 16:30:21 -04:00
Andrew Kerr
12f4108ac2
CUTLASS 2.9 ( #468 )
2022-04-23 15:02:38 -04:00
Feng Shijie
dd571f0edb
[style] fix code indentation ( #449 )
...
* [docs] fix typo in media/docs/layout.md
* [docs] fix comment error
* fix typo in include/cutlass/arch/simd_61.h
* fix stride comment errors in TensorLayout
* fix indentation
2022-04-03 21:13:17 -04:00
Haojin Yang
bc45e2c023
fixed datatype error of numeric_limit for uint1b_t ( #419 )
...
Co-authored-by: Haojin Yang <haojin.yang@.hpi.uni-potsdam.de>
2022-03-22 12:30:30 -04:00
Janusz Lisiecki
8f1fe7a132
Fix separate compilation -dc
( #433 )
...
* Fix separate compilation `-dc`
- when cutlass is included in multiple compilation units
compiled with `-dc` OOB_NAN_F16x8 device constant is
instantiated multiple times causing
Multiple definition of '_ZN7cutlass4arch13OOB_NAN_F16x8E' error
This PR makes this variable a local constant as it is not
modified during runtime
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Fix
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Test GH
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
* Revert test GH
Signed-off-by: Janusz Lisiecki <jlisiecki@nvidia.com>
2022-03-22 12:21:18 -04:00
Feng Shijie
cd39c75e25
Fix typo in docs, code comments ( #429 )
...
* [docs] fix typo in media/docs/layout.md
* [docs] fix comment error
* fix typo in include/cutlass/arch/simd_61.h
* fix stride comment errors in TensorLayout
2022-03-15 21:54:36 -04:00
HouQiming
96a11a1ef3
Removed trivial copy constructors on parameter classes to enable devi… ( #366 )
...
* Removed trivial copy constructors on parameter classes to enable device-side launch of CUTLASS kernels
* Added SFINAE to the `TensorRef(NonConstTensorRef const&)` constructor to avoid making it a copy-constructor for device code
* std => platform
* fix affine2
* really fix affine2
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-02-28 21:34:02 -05:00
Ivan Komarov
e96f00586c
Make cutlass::gemm::device::GemmArray usable ( #295 )
...
* Fix the build of cutlass/gemm/device/gemm_array.h and add a demo for GemmArray
* Add a reference to GemmArray to the docs
Co-authored-by: Ivan Komarov <dfyz@yandex-team.ru>
2022-02-17 20:01:05 -05:00
Jongsoo Park
1db6971a8d
Remove unused gemm_k_iterations in GemmKernel::Params ( #406 )
...
Otherwise we get gemm_k_iterations is uninitialized warnings.
2022-02-16 09:52:45 -05:00
Bing Xu
d0d941efc7
[hardswish] correct implmentation ( #403 )
...
* [hardswish] correct implmentation
* seems working
* hardswish fp32/fp16x2 optimization
* [relu] half2 support
* add relu0; add multiply_add_relu0;
* cleanup
Co-authored-by: Bing Xu <bingxu@fb.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2022-02-09 14:28:53 -05:00
Andrew Kerr
8a951b2940
Enable convolution with fused epilogue for Volta Tensor Cores ( #402 )
...
* Enabled convolution with epilogue fusion for Volta Tensor Cores.
* Compilation fixes
* Disabled testing Volta on Ampere architectures.
2022-01-30 23:24:50 -05:00
masahi
c2ee13a0fe
Add epilogue functor for residual block fusion ( #391 )
...
* Add epilogue functor for residual block fusion
* Do not run split-k tests when ActivationOp is not Identity
* explain TestSplitK param
* return early
2021-12-29 22:53:40 -05:00