Tri Dao
|
27f8f890df
|
[FusedDense] Allocate lt_workspace on input device
|
2023-05-30 14:17:26 -07:00 |
|
Tri Dao
|
dec4f2e910
|
[FusedDense] Set workspace size to 32M for Hopper and 4M for others
|
2023-04-06 23:40:15 -07:00 |
|
Tri Dao
|
dc08ea1c33
|
Support H100 for other CUDA extensions
|
2023-03-15 16:59:27 -07:00 |
|
Tri Dao
|
88173a1aaf
|
[FusedDense] Support relu, rename FusedDenseGeluDense -> FusedMLP
|
2023-01-17 18:12:27 -08:00 |
|
Tri Dao
|
226a1b721d
|
Implement TensorParallel for FusedDense and FusedDenseGeluDense
|
2022-12-24 11:48:56 -08:00 |
|
Tri Dao
|
e68ebbe89a
|
Simplify FusedDense
|
2022-12-22 21:25:31 -08:00 |
|
Tri Dao
|
43ab0b5205
|
Mention that some CUDA extensions have only been tested on A100s
|
2022-11-15 07:10:25 -08:00 |
|
Tri Dao
|
2e33fc8e36
|
Add GPT and ViT models
|
2022-11-13 22:30:23 -08:00 |
|
Tri Dao
|
fa6d1ce44f
|
Add fused_dense and dropout_add_layernorm CUDA extensions
|
2022-11-13 21:59:20 -08:00 |
|