Lily Liu
|
fe6d09ae61
|
[Minor] More fix of test_cache.py CI test failure (#2750)
|
2024-02-06 11:38:38 -08:00 |
|
Woosuk Kwon
|
f0d4e14557
|
Add fused top-K softmax kernel for MoE (#2769)
|
2024-02-05 17:38:02 -08:00 |
|
Hongxia Yang
|
56f738ae9b
|
[ROCm] Fix some kernels failed unit tests (#2498)
|
2024-02-05 14:25:36 -08:00 |
|
Kunshang Ji
|
96b6f475dd
|
Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
|
2024-02-01 15:46:39 -08:00 |
|
Philipp Moritz
|
d0d93b92b1
|
Add unit test for Mixtral MoE layer (#2677)
|
2024-01-31 14:34:17 -08:00 |
|
Philipp Moritz
|
89efcf1ce5
|
[Minor] Fix test_cache.py CI test failure (#2684)
|
2024-01-31 10:12:11 -08:00 |
|
Vladimir
|
4f65af0e25
|
Add swap_blocks unit tests (#2616)
|
2024-01-30 09:30:50 -08:00 |
|
wangding zeng
|
5d60def02c
|
DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
|
2024-01-29 21:19:48 -08:00 |
|
zhaoyang-star
|
9090bf02e7
|
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-28 16:43:54 -08:00 |
|
Hanzhi Zhou
|
380170038e
|
Implement custom all reduce kernels (#2192)
|
2024-01-27 12:46:35 -08:00 |
|
Simon Mo
|
3a7dd7e367
|
Support Batch Completion in Server (#2529)
|
2024-01-24 17:11:07 -08:00 |
|
Nikola Borisov
|
3209b49033
|
[Bugfix] fix crash if max_tokens=None (#2570)
|
2024-01-23 22:38:55 -08:00 |
|
Antoni Baum
|
9b945daaf1
|
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
|
2024-01-23 15:26:37 -08:00 |
|
Jason Zhu
|
7a0b011dd5
|
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553)
|
2024-01-22 14:47:25 -08:00 |
|
Cade Daniel
|
18bfcdd05c
|
[Speculative decoding 2/9] Multi-step worker for draft model (#2424)
|
2024-01-21 16:31:47 -08:00 |
|
Zhuohan Li
|
ef9b636e2d
|
Simplify broadcast logic for control messages (#2501)
|
2024-01-19 11:23:30 -08:00 |
|
Simon Mo
|
dd7e8f5f64
|
refactor complemention api for readability (#2499)
|
2024-01-18 16:45:14 -08:00 |
|
shiyi.c_98
|
d10f8e1d43
|
[Experimental] Prefix Caching Support (#1669)
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-01-17 16:32:10 -08:00 |
|
FlorianJoncour
|
14cc317ba4
|
OpenAI Server refactoring (#2360)
|
2024-01-16 21:33:14 -08:00 |
|
Hyunsung Lee
|
e1957c6ebd
|
Add StableLM3B model (#2372)
|
2024-01-16 20:32:40 -08:00 |
|
Simon Mo
|
6e01e8c1c8
|
[CI] Add Buildkite (#2355)
|
2024-01-14 12:37:58 -08:00 |
|
陈序
|
218dc2ccda
|
Aligning top_p and top_k Sampling (#1885)
* Align top_p and top_k with huggingface
* remove _get_prompt_and_output_tokens
* rename _apply_top_p_top_k
* compare top_p top_k with hf
* fix test errors
|
2024-01-12 22:51:03 +01:00 |
|
Cade Daniel
|
79d64c4954
|
[Speculative decoding 1/9] Optimized rejection sampler (#2336)
|
2024-01-09 15:38:41 -08:00 |
|
Woosuk Kwon
|
941767127c
|
Revert the changes in test_cache (#2335)
|
2024-01-03 17:32:05 -08:00 |
|
Zhuohan Li
|
fd4ea8ef5c
|
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221)
|
2024-01-03 11:30:22 -08:00 |
|
Jee Li
|
77af974b40
|
[FIX] Support non-zero CUDA devices in custom kernels (#1959)
|
2024-01-02 19:09:59 -08:00 |
|
Zhuohan Li
|
358c328d69
|
[BUGFIX] Fix communication test (#2285)
|
2023-12-27 17:18:11 -05:00 |
|
Zhuohan Li
|
4aaafdd289
|
[BUGFIX] Fix the path of test prompts (#2273)
|
2023-12-26 10:37:21 -08:00 |
|
Zhuohan Li
|
66b108d142
|
[BUGFIX] Fix API server test (#2270)
|
2023-12-26 10:37:06 -08:00 |
|
avideci
|
de60a3fb93
|
Added DeciLM-7b and DeciLM-7b-instruct (#2062)
|
2023-12-19 02:29:33 -08:00 |
|
Woosuk Kwon
|
f8c688d746
|
[Minor] Add Phi 2 to supported models (#2159)
|
2023-12-17 02:54:57 -08:00 |
|
Woosuk Kwon
|
f1c8520146
|
[BugFix] Fix input positions for long context with sliding window (#2088)
|
2023-12-13 12:28:13 -08:00 |
|
wbn
|
dacaf5a400
|
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
|
2023-12-10 10:12:53 -08:00 |
|
Woosuk Kwon
|
cd3aa153a4
|
Fix broken worker test (#1900)
|
2023-12-02 22:17:33 -08:00 |
|
Woosuk Kwon
|
9b294976a2
|
Add PyTorch-native implementation of custom layers (#1898)
|
2023-12-02 21:18:40 -08:00 |
|
Woosuk Kwon
|
5f09cbdb63
|
Fix broken sampler tests (#1896)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2023-12-02 16:06:17 -08:00 |
|
Adam Brusselback
|
66785cc05c
|
Support chat template and echo for chat API (#1756)
|
2023-11-30 16:43:13 -08:00 |
|
Yanming W
|
e0c6f556e8
|
[Build] Avoid building too many extensions (#1624)
|
2023-11-23 16:31:19 -08:00 |
|
Simon Mo
|
5ffc0d13a2
|
Migrate linter from pylint to ruff (#1665)
|
2023-11-20 11:58:01 -08:00 |
|
Zhuohan Li
|
20d0699d49
|
[Fix] Fix comm test (#1691)
|
2023-11-16 16:28:39 -08:00 |
|
maximzubkov
|
521b35f799
|
Support Microsoft Phi 1.5 (#1664)
|
2023-11-16 14:28:39 -08:00 |
|
Simon Mo
|
cb08cd0d75
|
[Minor] Fix duplication of ignored seq group in engine step (#1666)
|
2023-11-16 13:11:41 -08:00 |
|
Yanming W
|
8efe23f150
|
Fix input_metadata.selected_token_indices in worker prepare_inputs (#1546)
|
2023-11-08 14:19:12 -08:00 |
|
Noam Gat
|
555bdcc5a3
|
Added logits processor API to sampling params (#1469)
|
2023-11-03 14:12:15 -07:00 |
|
Cade Daniel
|
e575df33b1
|
[Small] Formatter only checks lints in changed files (#1528)
|
2023-10-31 15:39:38 -07:00 |
|
Woosuk Kwon
|
0ce8647dc5
|
Fix integer overflows in attention & cache ops (#1514)
|
2023-10-31 15:19:30 -07:00 |
|
Woosuk Kwon
|
9524867701
|
Add Mistral 7B to test_models (#1366)
|
2023-10-16 17:49:54 -07:00 |
|
Woosuk Kwon
|
d3a5bd9fb7
|
Fix sampler test (#1379)
|
2023-10-16 12:57:26 -07:00 |
|
Zhuohan Li
|
9d9072a069
|
Implement prompt logprobs & Batched topk for computing logprobs (#1328)
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
|
2023-10-16 10:56:50 -07:00 |
|
Woosuk Kwon
|
928de46888
|
Implement PagedAttention V2 (#1348)
|
2023-10-16 00:59:57 -07:00 |
|