Commit Graph

571 Commits

Author SHA1 Message Date
dan_the_3rd
b0e09d7cd3
Fix cutlass python library with cuda 12.6.2.post1 (#1942)
* Fix `cutlass` python library with cuda `12.6.2.post1`

Previously we had this error:
```
  File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp>
    _version_splits = [int(x) for x in __version__.split("rc")[0].split(".")]
                       ^^^^^^
ValueError: invalid literal for int() with base 10: 'post1'
```

* Update sm90_utils.py

* Update generator.py

* Update python/cutlass_library/generator.py

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>

* Update python/cutlass_library/sm90_utils.py

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>

---------

Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>
2024-11-18 09:06:32 -05:00
Lain
8aa95dbb88
Fix the racing condition of mixed-input gemm when writing the registers (#1931)
* move two warpgroup_wait

* merge main

---------

Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
2024-11-08 13:15:54 -05:00
LiYu Lu
d656afbd2a
fix undefined in device code error (#1880) 2024-11-06 14:56:54 -05:00
LiuQiang
32e3c38aef
remove restriction of stride == kernel in nhwc_pooling (#1896) 2024-11-06 14:54:53 -05:00
Wenlei Bao
9004ed2d1b
Update publications (#1912) 2024-11-06 14:54:15 -05:00
chenwei
19f51596e8
feat: support kFactor 8 used in mma tensor op tile iterator (#1512) 2024-10-29 11:56:59 -04:00
azhurkevich
e8a8b69365
Refactor some GroupedGEMM logic (#1899) 2024-10-25 20:14:01 -04:00
LiYu Lu
08a49953a0
Add a print for the uint{x}b_t type. (#1871) 2024-10-24 14:39:22 -04:00
Caleb_Du
a424ca6cf9
fix wrong A/BLayout in MMA_Traits for binary mma and append other MMA_Traits support (#1856)
* fix wrong A/BLayout in  MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for  m8n8k128, m16n8k128  mma.and.popc in MMA_Traits instantiation

* add "print" template for  subbyte_reference<T>
2024-10-24 14:38:35 -04:00
Lain
be692b48b0
remove redundant hardcoded packing configs in mixed dtype gemm (#1894)
Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
2024-10-23 14:24:09 -04:00
侯奇
12626bcfe4
Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlass/gemm/device/gemm_universal.h" (#1569)
fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`
2024-10-23 12:56:36 -04:00
MaxAkaAltmer
f02913c34e
Include of regular_tile_iterator.h fixed for NVRTC (#1765)
* Include of regular_tile_iterator.h fixed for NVRTC

* More include fixed for NVRTC
2024-10-23 12:55:59 -04:00
103yiran
03e3bffaec
Adjusting code indentation (#1639) 2024-10-23 12:55:02 -04:00
Lei Mao
e5f3caf145
Fix README (#1658)
* Fix README

* Improve README

---------

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2024-10-23 12:52:43 -04:00
Bogumil Sapinski Mobica
83ae20c740
added mapping for bf16 to torch::kBFloat16 (#1843)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2024-10-23 12:48:31 -04:00
Xinyu Yang
b0c09ed077
fix by adding public (#1753) 2024-10-23 12:45:58 -04:00
sijialou
ea69cc2849
fix typo (#1853) 2024-10-23 12:45:28 -04:00
Xinyu Yang
f3a3bfcbf2
add maximum support (#1833) 2024-10-23 12:44:56 -04:00
Sergey Klevtsov
d65266a868
Add all supported GMMA shapes (#1890) 2024-10-22 18:13:36 -04:00
Tri Dao
5b50a8faaf
Add GMMA shape m64n40k16 (#1864) 2024-10-21 20:41:47 -04:00
Sergey Klevtsov
08101d9d0c
Improve sm90 mixed dtype kernel (#1883) 2024-10-17 20:06:38 -04:00
Haicheng Wu
755194a7bd add is_last_tile 2024-10-17 12:11:02 -07:00
Saagar Jha
53668799b2
Handle MNK Sm90{Row, Col}Reduction problem shapes (#1803) 2024-10-14 19:46:20 -04:00
Yujia Zhai
cc3c29a81a
CUTLASS 3.6.0 (#1850)
* v3.6

* update changelog

* update readme

* fix typo

* fixing typos

* hopper gemm with weight prefetch

---------

Co-authored-by: yuzhai <yuzhai@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-10-09 15:33:27 -04:00
Feng Shijie
0837a2a00a
Fix typo in comment (#1787) 2024-10-07 12:39:59 -04:00
Alexander Zinoviev
477a677317
Fix typos in test/unit/conv/cache_testbed_output.h (#1652)
Co-authored-by: Alexander Zinoviev <azinoviev@tesla.com>
2024-10-07 12:39:11 -04:00
Wilber
b27c49e84a
Fix cute doc (#1529) 2024-10-07 12:38:32 -04:00
Junkai-Wu
e2b0789927
Add some can implement rules of hopper convolution. (#1835) 2024-09-25 11:28:10 -04:00
Wenlei Bao
44dae8b90e
Adjust profiler space for SM89 (#1553) 2024-09-19 11:40:30 -04:00
reed
2991ce18d3
Add print_svg for mma (#1733)
* add print_svg for mma

* correct the code indentation
2024-09-18 10:37:24 -04:00
Chenggang Zhao
1ebda1ccef
Fix MMA promotion interval assertions (#1641) 2024-09-16 12:38:42 -04:00
reed
9f68995de5
add publication: ‘EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree’ (#1526)
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2024-09-16 11:55:09 -04:00
John Shumway
3a8c01a18b
Prefix a member template name with the template keyword. (#1796)
Fixes llvm buld error.
2024-09-11 13:33:56 -04:00
Junkai-Wu
dbdae514e0
Support for TMA Epilogue for Group Gemm and add pingpong ptr array & Group Gemm (#1795) 2024-09-11 00:07:31 -04:00
Sean Xiaowen Zhang
21d0534167
fix assertion (#1790) 2024-09-09 14:05:27 -04:00
Tri Dao
323c8170bf
Support ComputeFn where output type differs from input type (#1771)
This is useful for e.g. function taking in 2 float inputs and turn them to complex
2024-09-05 23:25:03 -04:00
Gabriel Wu
82f5075946
set_slice3x3 -> set_slice_3x3 (#1784) 2024-09-05 23:24:10 -04:00
Saagar Jha
06e337758d
Remove extraneous comma in declaration (#1776) 2024-09-05 17:14:15 -04:00
JiayuSun
7369adcaca
Add Sm90LinCombPerColBias (#1774)
Co-authored-by: Jiayu Sun <jiayus@s4124-0071.nvidia.com>
2024-09-04 15:11:24 -04:00
Alchan Kim
6c3044136b
Update barrier.h (#1782) 2024-09-04 14:52:11 -04:00
Aleksandar Samardžić
e1976daacc
Add support for mixed 4-bit/8-bit data types GEMM (#1413)
* Add support for mixed 4-bit/8-bit data types GEMM

* fix ( and )

---------

Co-authored-by: Aleksandar Samardžić <asamardzic@matf.bg.ac.rs>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
2024-08-29 23:11:06 -04:00
Shreya Gaur
f7b19de32c
minor fix for a double quote in CMakeLists.txt (#1727) 2024-08-19 22:21:42 -04:00
shunfan-shao
4dbf5dbed2
Use CUDA runtime API to retrieve function pointer to driver API (#1700)
* Query pfn to driver api

* use default for older toolkits

---------

Co-authored-by: shunfans <shunfans@nvidia.com>
2024-08-19 13:26:09 -04:00
Dustyn Blasig
f93a69134e
Merge pull request #1714 from NVIDIA/u128_div
fix uint128
2024-08-16 07:14:59 -05:00
Aleksandar Samardžić
3f084f7f3c
Add couple configs into generator.py for mixed input MM (#1350)
* Add couple configs into generator.py for mixed input MM

* change one unit test name; reenable 128x32 in the profiler

* Added U8/BF16 tests.

---------

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
2024-08-16 00:59:29 -04:00
Haicheng Wu
b0296bf682 fix uint128 2024-08-15 21:06:01 -07:00
Dustyn Blasig
865be73a97
Merge pull request #1713 from NVIDIA/351_sparse_update
update 3.5.1 readme/changelog
2024-08-15 11:44:49 -05:00
Haicheng Wu
8d8cfdf375 update 3.5.1 readme/changelog 2024-08-14 21:12:44 -07:00
eqy
fb170439e8
Update half.h (#1709) 2024-08-14 14:59:59 -04:00
dePaul Miller
4e5a8f6853
3.5.1 plots and updated readme (#1708)
Co-authored-by: dePaul Miller <23461061+depaulmillz@users.noreply.github.com>
2024-08-12 18:55:55 -04:00