Commit Graph

194 Commits

Author SHA1 Message Date
Zhuohan Li
e90fc21f2e
[Hardware][Neuron] Refactor neuron support (#3471) 2024-03-22 01:22:17 +00:00
Roy
ea5f14e6ff
[Bugfix][Model] Fix Qwen2 (#3554) 2024-03-22 00:18:58 +00:00
Roy
f1c0fc3919
Migrate logits computation and gather to model_runner (#3233) 2024-03-20 23:25:01 +00:00
SangBin Cho
6e435de766
[1/n][Chunked Prefill] Refactor input query shapes (#3236) 2024-03-20 14:46:05 -07:00
Antoni Baum
426ec4ec67
[1/n] Triton sampling kernel (#3186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-20 14:45:08 -07:00
Woosuk Kwon
5ee14494e4
[Misc] Remove cache stream and cache events (#3461) 2024-03-20 00:38:53 -07:00
ElizaWszola
9474e89ba4
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled (#3357)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-20 00:11:11 -07:00
Robert Shaw
097aa0ea22
[CI/Build] Fix Bad Import In Test (#3473) 2024-03-18 20:28:00 +00:00
Simon Mo
120157fd2a
Support arbitrary json_object in OpenAI and Context Free Grammar (#3211) 2024-03-16 13:35:27 -07:00
simon-mo
ad50bf4b25 fix lint 2024-03-15 22:23:38 -07:00
Tao He
3123f15138
Fixes the incorrect argument in the prefix-prefill test cases (#3246) 2024-03-15 20:58:10 -07:00
Antoni Baum
fb96c1e98c
Asynchronous tokenization (#2879) 2024-03-15 23:37:01 +00:00
陈序
54be8a0be2
Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-03-14 13:56:57 -07:00
Terry
7e9bd08f60
Add batched RoPE kernel (#3095) 2024-03-13 13:45:26 -07:00
Or Sharir
ae0ccb4017
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350) 2024-03-13 12:18:25 -07:00
Woosuk Kwon
602358f8a8
Add kernel for GeGLU with approximate GELU (#3337) 2024-03-12 22:06:17 -07:00
Breno Faria
49a3c8662b
Fixes #1556 double free (#3347) 2024-03-13 00:30:08 +00:00
Zhuohan Li
4c922709b6
Add distributed model executor abstraction (#3191) 2024-03-11 11:03:45 -07:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit (#3305) 2024-03-10 19:49:14 -07:00
Roy
9e8744a545
[BugFix] Fix get tokenizer when using ray (#3301) 2024-03-10 19:17:16 -07:00
Terry
0bba88df03
Enhance lora tests with more layer and rank variations (#3243) 2024-03-09 17:14:16 -08:00
Cade Daniel
8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103) 2024-03-08 23:32:46 -08:00
ElizaWszola
b35cc93420
Fix auto prefix bug (#3239) 2024-03-07 16:37:28 -08:00
jacobthebanana
8cbba4622c
Possible fix for conflict between Automated Prefix Caching (#2762) and multi-LoRA support (#1804) (#3263) 2024-03-07 23:03:22 +00:00
Woosuk Kwon
2daf23ab0c
Separate attention backends (#3005) 2024-03-07 01:45:50 -08:00
Cade Daniel
a33ce60c66
[Testing] Fix core tests (#3224) 2024-03-06 01:04:23 -08:00
SangBin Cho
24aecf421a
[Tests] Add block manager and scheduler tests (#3108) 2024-03-05 18:23:34 -08:00
Nick Hill
8999ec3c16
Store eos_token_id in Sequence for easy access (#3166) 2024-03-05 15:35:43 -08:00
Antoni Baum
ff578cae54
Add health check, make async Engine more robust (#3015)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-04 22:01:40 +00:00
Antoni Baum
22de45235c
Push logprob generation to LLMEngine (#3065)
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching (#2762)
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference (#2497)
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
felixzhu555
703e42ee4b
Add guided decoding for OpenAI API server (#2819)
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
Seonghyeon
bfdcfa6a05
Support starcoder2 architecture (#3089) 2024-02-29 00:51:48 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma (#3050) 2024-02-28 13:03:28 -08:00
Liangfu Chen
3b7178cfa4
[Neuron] Support inference with transformers-neuronx (#2569) 2024-02-28 09:34:34 -08:00
Tao He
71bcaf99e2
Enable GQA support in the prefix prefill kernels (#3007)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-02-27 01:14:31 -08:00
Dylan Hawk
e0ade06d63
Support logit bias for OpenAI API (#3027) 2024-02-27 11:51:53 +08:00
Jared Moore
70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI (#2918) 2024-02-26 10:39:34 +08:00
Harry Mellor
ef978fe411
Port metrics from aioprometheus to prometheus_client (#2730) 2024-02-25 11:54:00 -08:00
Ronen Schaffer
4caf7044e0
Include tokens from prompt phase in counter_generation_tokens (#2802) 2024-02-22 14:00:12 -08:00
Woosuk Kwon
fd5dcc5c81
Optimize GeGLU layer in Gemma (#2975) 2024-02-21 20:17:52 -08:00
Massimiliano Pronesti
93dc5a2870
chore(vllm): codespell for spell checking (#2820) 2024-02-21 18:56:01 -08:00
Nick Hill
7d2dcce175
Support per-request seed (#2514) 2024-02-21 11:47:00 -08:00
Antoni Baum
017d9f1515
Add metrics to RequestOutput (#2876) 2024-02-20 21:55:57 -08:00
Zhuohan Li
63e2a6419d
[FIX] Fix beam search test (#2930) 2024-02-20 14:37:39 -08:00
Ronen Schaffer
e433c115bc
Fix vllm:prompt_tokens_total metric calculation (#2869) 2024-02-18 23:55:41 -08:00
Isotr0py
ab3a5a8259
Support OLMo models. (#2832) 2024-02-18 21:05:15 -08:00
Zhuohan Li
a61f0521b8
[Test] Add basic correctness test (#2908) 2024-02-18 16:44:50 -08:00
jvmncs
8f36444c4f
multi-LoRA as extra models in OpenAI server (#2775)
how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)):
```terminal
$ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/
$ python -m vllm.entrypoints.api_server \
 --model meta-llama/Llama-2-7b-hf \
 --enable-lora \
 --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH
```
the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs

no work has been done here to scope client permissions to specific models
2024-02-17 12:00:48 -08:00
Woosuk Kwon
d7afab6d3a
[BugFix] Fix GC bug for LLM class (#2882) 2024-02-14 22:17:44 -08:00
Terry
2a543d6efe
Add LoRA support for Mixtral (#2831)
* add mixtral lora support

* formatting

* fix incorrectly ported logic

* polish tests

* minor fixes and refactoring

* minor fixes

* formatting

* rename and remove redundant logic

* refactoring

* refactoring

* minor fix

* minor refactoring

* fix code smell
2024-02-14 00:55:45 +01:00
Lily Liu
fe6d09ae61
[Minor] More fix of test_cache.py CI test failure (#2750) 2024-02-06 11:38:38 -08:00
Woosuk Kwon
f0d4e14557
Add fused top-K softmax kernel for MoE (#2769) 2024-02-05 17:38:02 -08:00
Hongxia Yang
56f738ae9b
[ROCm] Fix some kernels failed unit tests (#2498) 2024-02-05 14:25:36 -08:00
Kunshang Ji
96b6f475dd
Remove hardcoded device="cuda" to support more devices (#2503)
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
Philipp Moritz
d0d93b92b1
Add unit test for Mixtral MoE layer (#2677) 2024-01-31 14:34:17 -08:00
Philipp Moritz
89efcf1ce5
[Minor] Fix test_cache.py CI test failure (#2684) 2024-01-31 10:12:11 -08:00
Vladimir
4f65af0e25
Add swap_blocks unit tests (#2616) 2024-01-30 09:30:50 -08:00
wangding zeng
5d60def02c
DeepseekMoE support with Fused MoE kernel (#2453)
Co-authored-by: roy <jasonailu87@gmail.com>
2024-01-29 21:19:48 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache (#2279)
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels (#2192) 2024-01-27 12:46:35 -08:00
Simon Mo
3a7dd7e367
Support Batch Completion in Server (#2529) 2024-01-24 17:11:07 -08:00
Nikola Borisov
3209b49033
[Bugfix] fix crash if max_tokens=None (#2570) 2024-01-23 22:38:55 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support (#1804)
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Jason Zhu
7a0b011dd5
Add a 1-line docstring to explain why calling context_attention_fwd twice in test_prefix_prefill.py (#2553) 2024-01-22 14:47:25 -08:00
Cade Daniel
18bfcdd05c
[Speculative decoding 2/9] Multi-step worker for draft model (#2424) 2024-01-21 16:31:47 -08:00
Zhuohan Li
ef9b636e2d
Simplify broadcast logic for control messages (#2501) 2024-01-19 11:23:30 -08:00
Simon Mo
dd7e8f5f64
refactor complemention api for readability (#2499) 2024-01-18 16:45:14 -08:00
shiyi.c_98
d10f8e1d43
[Experimental] Prefix Caching Support (#1669)
Co-authored-by: DouHappy <2278958187@qq.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-17 16:32:10 -08:00
FlorianJoncour
14cc317ba4
OpenAI Server refactoring (#2360) 2024-01-16 21:33:14 -08:00
Hyunsung Lee
e1957c6ebd
Add StableLM3B model (#2372) 2024-01-16 20:32:40 -08:00
Simon Mo
6e01e8c1c8
[CI] Add Buildkite (#2355) 2024-01-14 12:37:58 -08:00
陈序
218dc2ccda
Aligning top_p and top_k Sampling (#1885)
* Align top_p and top_k with huggingface

* remove _get_prompt_and_output_tokens

* rename _apply_top_p_top_k

* compare top_p top_k with hf

* fix test errors
2024-01-12 22:51:03 +01:00
Cade Daniel
79d64c4954
[Speculative decoding 1/9] Optimized rejection sampler (#2336) 2024-01-09 15:38:41 -08:00
Woosuk Kwon
941767127c
Revert the changes in test_cache (#2335) 2024-01-03 17:32:05 -08:00
Zhuohan Li
fd4ea8ef5c
Use NCCL instead of ray for control-plane communication to remove serialization overhead (#2221) 2024-01-03 11:30:22 -08:00
Jee Li
77af974b40
[FIX] Support non-zero CUDA devices in custom kernels (#1959) 2024-01-02 19:09:59 -08:00
Zhuohan Li
358c328d69
[BUGFIX] Fix communication test (#2285) 2023-12-27 17:18:11 -05:00
Zhuohan Li
4aaafdd289
[BUGFIX] Fix the path of test prompts (#2273) 2023-12-26 10:37:21 -08:00
Zhuohan Li
66b108d142
[BUGFIX] Fix API server test (#2270) 2023-12-26 10:37:06 -08:00
avideci
de60a3fb93
Added DeciLM-7b and DeciLM-7b-instruct (#2062) 2023-12-19 02:29:33 -08:00
Woosuk Kwon
f8c688d746
[Minor] Add Phi 2 to supported models (#2159) 2023-12-17 02:54:57 -08:00
Woosuk Kwon
f1c8520146
[BugFix] Fix input positions for long context with sliding window (#2088) 2023-12-13 12:28:13 -08:00
wbn
dacaf5a400
Replace head_mapping params with num_kv_heads to attention kernel. (#1997)
Co-authored-by: wangguoya <wangguoya@baidu.com>
Co-authored-by: Yang Zhao <zhaoyangstar@foxmail.com>
2023-12-10 10:12:53 -08:00
Woosuk Kwon
cd3aa153a4
Fix broken worker test (#1900) 2023-12-02 22:17:33 -08:00
Woosuk Kwon
9b294976a2
Add PyTorch-native implementation of custom layers (#1898) 2023-12-02 21:18:40 -08:00
Woosuk Kwon
5f09cbdb63
Fix broken sampler tests (#1896)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-02 16:06:17 -08:00
Adam Brusselback
66785cc05c
Support chat template and echo for chat API (#1756) 2023-11-30 16:43:13 -08:00
Yanming W
e0c6f556e8
[Build] Avoid building too many extensions (#1624) 2023-11-23 16:31:19 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint to ruff (#1665) 2023-11-20 11:58:01 -08:00
Zhuohan Li
20d0699d49
[Fix] Fix comm test (#1691) 2023-11-16 16:28:39 -08:00
maximzubkov
521b35f799
Support Microsoft Phi 1.5 (#1664) 2023-11-16 14:28:39 -08:00
Simon Mo
cb08cd0d75
[Minor] Fix duplication of ignored seq group in engine step (#1666) 2023-11-16 13:11:41 -08:00
Yanming W
8efe23f150
Fix input_metadata.selected_token_indices in worker prepare_inputs (#1546) 2023-11-08 14:19:12 -08:00
Noam Gat
555bdcc5a3
Added logits processor API to sampling params (#1469) 2023-11-03 14:12:15 -07:00
Cade Daniel
e575df33b1
[Small] Formatter only checks lints in changed files (#1528) 2023-10-31 15:39:38 -07:00
Woosuk Kwon
0ce8647dc5
Fix integer overflows in attention & cache ops (#1514) 2023-10-31 15:19:30 -07:00
Woosuk Kwon
9524867701
Add Mistral 7B to test_models (#1366) 2023-10-16 17:49:54 -07:00
Woosuk Kwon
d3a5bd9fb7
Fix sampler test (#1379) 2023-10-16 12:57:26 -07:00