mawong-amd
|
b6d103542c
|
[Kernel] Layernorm performance optimization (#3662)
|
2024-03-30 14:26:38 -07:00 |
|
Roger Wang
|
45b6ef6513
|
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277)
|
2024-03-27 13:39:26 -07:00 |
|
SangBin Cho
|
01bfb22b41
|
[CI] Try introducing isort. (#3495)
|
2024-03-25 07:59:47 -07:00 |
|
Woosuk Kwon
|
925f3332ca
|
[Core] Refactor Attention Take 2 (#3462)
|
2024-03-25 04:39:33 +00:00 |
|
youkaichao
|
8b268a46a7
|
[CI] typo fix: is_hip --> is_hip() (#3595)
|
2024-03-24 16:03:06 -07:00 |
|
Nick Hill
|
41deac4a3d
|
[BugFix] 1D query fix for MoE models (#3597)
|
2024-03-24 16:00:16 -07:00 |
|
Antoni Baum
|
426ec4ec67
|
[1/n] Triton sampling kernel (#3186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-20 14:45:08 -07:00 |
|
simon-mo
|
ad50bf4b25
|
fix lint
|
2024-03-15 22:23:38 -07:00 |
|
Tao He
|
3123f15138
|
Fixes the incorrect argument in the prefix-prefill test cases (#3246)
|
2024-03-15 20:58:10 -07:00 |
|
Terry
|
7e9bd08f60
|
Add batched RoPE kernel (#3095)
|
2024-03-13 13:45:26 -07:00 |
|
Woosuk Kwon
|
602358f8a8
|
Add kernel for GeGLU with approximate GELU (#3337)
|
2024-03-12 22:06:17 -07:00 |
|
Zhuohan Li
|
2f8844ba08
|
Re-enable the 80 char line width limit (#3305)
|
2024-03-10 19:49:14 -07:00 |
|
Woosuk Kwon
|
2daf23ab0c
|
Separate attention backends (#3005)
|
2024-03-07 01:45:50 -08:00 |
|
Tao He
|
71bcaf99e2
|
Enable GQA support in the prefix prefill kernels (#3007)
Signed-off-by: Tao He <sighingnow@gmail.com>
|
2024-02-27 01:14:31 -08:00 |
|
Woosuk Kwon
|
fd5dcc5c81
|
Optimize GeGLU layer in Gemma (#2975)
|
2024-02-21 20:17:52 -08:00 |
|
Lily Liu
|
fe6d09ae61
|
[Minor] More fix of test_cache.py CI test failure (#2750)
|
2024-02-06 11:38:38 -08:00 |
|
Woosuk Kwon
|
f0d4e14557
|
Add fused top-K softmax kernel for MoE (#2769)
|
2024-02-05 17:38:02 -08:00 |
|
Hongxia Yang
|
56f738ae9b
|
[ROCm] Fix some kernels failed unit tests (#2498)
|
2024-02-05 14:25:36 -08:00 |
|
Kunshang Ji
|
96b6f475dd
|
Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2024-02-01 15:46:39 -08:00 |
|
Philipp Moritz
|
d0d93b92b1
|
Add unit test for Mixtral MoE layer (#2677)
|
2024-01-31 14:34:17 -08:00 |
|
Philipp Moritz
|
89efcf1ce5
|
[Minor] Fix test_cache.py CI test failure (#2684)
|
2024-01-31 10:12:11 -08:00 |
|
Vladimir
|
4f65af0e25
|
Add swap_blocks unit tests (#2616)
|
2024-01-30 09:30:50 -08:00 |
|
wangding zeng
|
5d60def02c
|
DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
|
2024-01-29 21:19:48 -08:00 |
|
zhaoyang-star
|
9090bf02e7
|
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-28 16:43:54 -08:00 |
|
Jason Zhu
|
7a0b011dd5
|
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553)
|
2024-01-22 14:47:25 -08:00 |
|
shiyi.c_98
|
d10f8e1d43
|
[Experimental] Prefix Caching Support (#1669)
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-17 16:32:10 -08:00 |
|
Simon Mo
|
6e01e8c1c8
|
[CI] Add Buildkite (#2355)
|
2024-01-14 12:37:58 -08:00 |
|
Woosuk Kwon
|
941767127c
|
Revert the changes in test_cache (#2335)
|
2024-01-03 17:32:05 -08:00 |
|
Zhuohan Li
|
fd4ea8ef5c
|
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221)
|
2024-01-03 11:30:22 -08:00 |
|
Jee Li
|
77af974b40
|
[FIX] Support non-zero CUDA devices in custom kernels (#1959)
|
2024-01-02 19:09:59 -08:00 |
|
wbn
|
dacaf5a400
|
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
|
2023-12-10 10:12:53 -08:00 |
|
Woosuk Kwon
|
9b294976a2
|
Add PyTorch-native implementation of custom layers (#1898)
|
2023-12-02 21:18:40 -08:00 |
|
Yanming W
|
e0c6f556e8
|
[Build] Avoid building too many extensions (#1624)
|
2023-11-23 16:31:19 -08:00 |
|
Simon Mo
|
5ffc0d13a2
|
Migrate linter from pylint to ruff (#1665)
|
2023-11-20 11:58:01 -08:00 |
|
Woosuk Kwon
|
0ce8647dc5
|
Fix integer overflows in attention & cache ops (#1514)
|
2023-10-31 15:19:30 -07:00 |
|
Woosuk Kwon
|
928de46888
|
Implement PagedAttention V2 (#1348)
|
2023-10-16 00:59:57 -07:00 |
|
Zhuohan Li
|
ba0bfd40e2
|
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181)
|
2023-10-02 15:36:09 -07:00 |
|
Woosuk Kwon
|
6f88f762bf
|
Fix OOM in attention kernel test (#1223)
|
2023-09-28 14:33:24 -07:00 |
|
Antoni Baum
|
cf5cb1e33e
|
Allocate more shared memory to attention kernel (#1154)
|
2023-09-26 22:27:13 -07:00 |
|
Woosuk Kwon
|
e67b4f2c2a
|
Use FP32 in RoPE initialization (#1004)
Co-authored-by: One <imone@tuta.io>
|
2023-09-11 00:26:35 -07:00 |
|
Zhuohan Li
|
db09d4ad83
|
[FIX] Fix Alibi implementation in PagedAttention kernel (#945)
* [FIX] Fix Alibi implementation in PagedAttention kernel
* Fix test_attention
* Fix
---------
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Oliver-ss <yuansongwx@outlook.com>
|
2023-09-07 15:53:14 -07:00 |
|
Woosuk Kwon
|
320a622ec4
|
[BugFix] Implement RoPE for GPT-J (#941)
|
2023-09-06 11:54:33 +09:00 |
|
Woosuk Kwon
|
fbd80ad409
|
Clean up kernel unit tests (#938)
|
2023-09-05 16:57:38 -07:00 |
|
Aman Gupta Karmani
|
75471386de
|
use flash-attn via xformers (#877)
|
2023-08-29 21:52:13 -07:00 |
|
Woosuk Kwon
|
d64bf1646c
|
Implement approximate GELU kernels (#828)
|
2023-08-23 07:43:21 +09:00 |
|
Tao Peng
|
d7a1c6d614
|
Fix paged attention testing. (#495)
Signed-off-by: Tao Peng <jiankeng.pt@alibaba-inc.com>
|
2023-07-24 21:01:56 -07:00 |
|
Song
|
bda41c70dd
|
hotfix attn alibi wo head mapping (#496)
Co-authored-by: oliveryuan <oliveryuan@basemind.com>
|
2023-07-18 11:31:48 -07:00 |
|
Andre Slavescu
|
c894836108
|
[Model] Add support for GPT-J (#226)
Co-authored-by: woWoosuk Kwon <woosuk.kwon@berkeley.edu>
|
2023-07-08 17:55:16 -07:00 |
|
Woosuk Kwon
|
e41f06702c
|
Add support for BLOOM (#331)
|
2023-07-03 13:12:35 -07:00 |
|
Zhuohan Li
|
d6fa1be3a8
|
[Quality] Add code formatter and linter (#326)
|
2023-07-03 11:31:55 -07:00 |
|