squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Nick Hill	e46a60aa4c	[BugFix] Fix handling of stop strings and stop token ids (#3672 )	2024-04-11 15:34:12 -07:00
Antoni Baum	1e96c3341a	Add extra punica sizes to support bigger vocabs (#4015 )	2024-04-11 22:18:57 +00:00
Dylan Hawk	95e7d4a97c	Fix echo/logprob OpenAI completion bug (#3441 ) Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>	2024-04-11 22:15:50 +00:00
Antoni Baum	a10d3056da	[Core] Set `linear_weights` directly on the layer (#3977 )	2024-04-11 16:35:51 -04:00
Kunshang Ji	e9da5a40c6	[Misc] Add indirection layer for custom ops (#3913 )	2024-04-10 20:26:07 -07:00
SangBin Cho	e42df7227d	[Test] Add xformer and flash attn tests (#3961 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-04-11 03:09:50 +00:00
SangBin Cho	67b4221a61	[Core][5/N] Fully working chunked prefill e2e (#3884 )	2024-04-10 17:56:48 -07:00
youkaichao	63e7176f26	[Core][Refactor] move parallel_utils into vllm/distributed (#3950 ) [WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)	2024-04-10 15:33:30 -07:00
Travis Johnson	0258b7a94b	[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-04-10 01:39:56 -07:00
胡译文	b3104b2a10	[Bugfix] Fix logits processor when prompt_logprobs is not None (#3899 )	2024-04-10 00:09:36 -07:00
Jee Li	11dd6ebb89	[Misc] Avoid loading incorrect LoRA config (#3777 )	2024-04-09 19:47:15 -07:00
Cade Daniel	e7c7067b45	[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837 )	2024-04-09 11:44:15 -07:00
youkaichao	95baec828f	[Core] enable out-of-tree model register (#3871 )	2024-04-06 17:11:41 -07:00
SangBin Cho	18de883489	[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853 )	2024-04-05 10:17:58 -07:00
Cade Daniel	e5043a3e75	[Misc] Add pytest marker to opt-out of global test cleanup (#3863 )	2024-04-04 21:54:16 -07:00
Matthias Gerstgrasser	aabe8f40f2	[Core] [Frontend] Make detokenization optional (#3749 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-04-03 21:52:18 -07:00
Michael Feil	537ee25f43	[Core] Enable hf_transfer by default if available (#3817 )	2024-04-04 04:02:43 +00:00
Adrian Abeyta	2ff767b513	Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290 ) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-04-03 14:15:55 -07:00
SangBin Cho	3dcb3e8b98	[3/N] Refactor scheduler for chunked prefill scheduling (#3550 )	2024-04-03 14:13:49 -07:00
Cade Daniel	5757d90e26	[Speculative decoding] Adding configuration object for speculative decoding (#3706 ) Co-authored-by: Lily Liu <lilyliupku@gmail.com>	2024-04-03 00:40:57 +00:00
Cade Daniel	eb69d68804	[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup (#3783 )	2024-04-02 00:49:51 +00:00
Qubitium	7d4e1b85e7	[Misc] Add support for new autogptq checkpoint_format (#3689 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>	2024-04-01 19:32:01 -04:00
Cade Daniel	93deb0b38f	[Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250 )	2024-04-01 22:55:24 +00:00
Nick Hill	49782fcb76	[Misc] Some minor simplifications to detokenization logic (#3670 ) Some simplifications made for clarity. Also moves detokenization-related functions from tokenizer.py to detokenizer.py.	2024-04-01 13:22:06 -07:00
Robert Shaw	563c1d7ec5	[CI/Build] Make Marlin Tests Green (#3753 )	2024-03-30 19:18:34 -07:00
mawong-amd	b6d103542c	[Kernel] Layernorm performance optimization (#3662 )	2024-03-30 14:26:38 -07:00
Roy	f510395bbf	[BugFix][Frontend] Fix completion logprobs=0 error (#3731 )	2024-03-29 09:38:21 -07:00
Roy	6110c39dc8	[BugFix] Fix tokenizer out of vocab size (#3685 )	2024-03-29 08:18:59 -07:00
youkaichao	756b30a5f3	[Core][Test] move local_rank to the last arg with default value(#3711 ) [Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711)	2024-03-28 21:19:45 -07:00
SangBin Cho	26422e477b	[Test] Make model tests run again and remove --forked from pytest (#3631 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-03-28 21:06:40 -07:00
Roy	515386ef3c	[Core] Support multi-node inference(eager and cuda graph) (#3686 )	2024-03-28 15:01:55 -07:00
SangBin Cho	b51c1cc9d2	[2/N] Chunked prefill data update (#3538 )	2024-03-28 10:06:01 -07:00
Cade Daniel	14ccd94c89	[Core][Bugfix]Refactor block manager for better testability (#3492 )	2024-03-27 23:59:28 -07:00
Roger Wang	45b6ef6513	feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277 )	2024-03-27 13:39:26 -07:00
youkaichao	8f44facddd	[Core] remove cupy dependency (#3625 )	2024-03-27 00:33:26 -07:00
Jee Li	566b57c5c4	[Kernel] support non-zero cuda devices in punica kernels (#3636 )	2024-03-27 00:37:42 +00:00
Jee Li	8af890a865	Enable more models to inference based on LoRA (#3382 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-03-25 18:09:31 -07:00
Nick Hill	dfeb2ecc3a	[Misc] Include matched stop string/token in responses (#2976 ) Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>	2024-03-25 17:31:32 -07:00
xwjiang2010	64172a976c	[Feature] Add vision language model support. (#3042 )	2024-03-25 14:16:30 -07:00
Simon Mo	f408d05c52	hotfix isort on logprobs ranks pr (#3622 )	2024-03-25 11:55:46 -07:00
Dylan Hawk	0b4997e05c	[Bugfix] API stream returning two stops (#3450 ) Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>	2024-03-25 10:14:34 -07:00
Travis Johnson	c13ad1b7bd	feat: implement the min_tokens sampling parameter (#3124 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-03-25 10:14:26 -07:00
Swapnil Parekh	819924e749	[Core] Adding token ranks along with logprobs (#3516 ) Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>	2024-03-25 10:13:10 -07:00
SangBin Cho	01bfb22b41	[CI] Try introducing isort. (#3495 )	2024-03-25 07:59:47 -07:00
Woosuk Kwon	925f3332ca	[Core] Refactor Attention Take 2 (#3462 )	2024-03-25 04:39:33 +00:00
youkaichao	837e185142	[CI/Build] fix flaky test (#3602 )	2024-03-24 17:43:05 -07:00
youkaichao	8b268a46a7	[CI] typo fix: is_hip --> is_hip() (#3595 )	2024-03-24 16:03:06 -07:00
Nick Hill	41deac4a3d	[BugFix] 1D query fix for MoE models (#3597 )	2024-03-24 16:00:16 -07:00
Antoni Baum	bfdb1ba5c3	[Core] Improve detokenization performance for prefill (#3469 ) Co-authored-by: MeloYang <meloyang05@gmail.com>	2024-03-22 13:44:12 -07:00
Thomas Parnell	cf2f084d56	Dynamic scheduler delay to improve ITL performance (#3279 ) Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>	2024-03-22 12:28:14 -07:00
Zhuohan Li	e90fc21f2e	[Hardware][Neuron] Refactor neuron support (#3471 )	2024-03-22 01:22:17 +00:00
Roy	ea5f14e6ff	[Bugfix][Model] Fix Qwen2 (#3554 )	2024-03-22 00:18:58 +00:00
Roy	f1c0fc3919	Migrate `logits` computation and gather to `model_runner` (#3233 )	2024-03-20 23:25:01 +00:00
SangBin Cho	6e435de766	[1/n][Chunked Prefill] Refactor input query shapes (#3236 )	2024-03-20 14:46:05 -07:00
Antoni Baum	426ec4ec67	[1/n] Triton sampling kernel (#3186 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-03-20 14:45:08 -07:00
Woosuk Kwon	5ee14494e4	[Misc] Remove cache stream and cache events (#3461 )	2024-03-20 00:38:53 -07:00
ElizaWszola	9474e89ba4	[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled (#3357 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-03-20 00:11:11 -07:00
Robert Shaw	097aa0ea22	[CI/Build] Fix Bad Import In Test (#3473 )	2024-03-18 20:28:00 +00:00
Simon Mo	120157fd2a	Support arbitrary json_object in OpenAI and Context Free Grammar (#3211 )	2024-03-16 13:35:27 -07:00
simon-mo	ad50bf4b25	fix lint	2024-03-15 22:23:38 -07:00
Tao He	3123f15138	Fixes the incorrect argument in the prefix-prefill test cases (#3246 )	2024-03-15 20:58:10 -07:00
Antoni Baum	fb96c1e98c	Asynchronous tokenization (#2879 )	2024-03-15 23:37:01 +00:00
陈序	54be8a0be2	Fix assertion failure in Qwen 1.5 with prefix caching enabled (#3373 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-03-14 13:56:57 -07:00
Terry	7e9bd08f60	Add batched RoPE kernel (#3095 )	2024-03-13 13:45:26 -07:00
Or Sharir	ae0ccb4017	Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. (#3350 )	2024-03-13 12:18:25 -07:00
Woosuk Kwon	602358f8a8	Add kernel for GeGLU with approximate GELU (#3337 )	2024-03-12 22:06:17 -07:00
Breno Faria	49a3c8662b	Fixes #1556 double free (#3347 )	2024-03-13 00:30:08 +00:00
Zhuohan Li	4c922709b6	Add distributed model executor abstraction (#3191 )	2024-03-11 11:03:45 -07:00
Zhuohan Li	2f8844ba08	Re-enable the 80 char line width limit (#3305 )	2024-03-10 19:49:14 -07:00
Roy	9e8744a545	[BugFix] Fix get tokenizer when using ray (#3301 )	2024-03-10 19:17:16 -07:00
Terry	0bba88df03	Enhance lora tests with more layer and rank variations (#3243 )	2024-03-09 17:14:16 -08:00
Cade Daniel	8437bae6ef	[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling (#3103 )	2024-03-08 23:32:46 -08:00
ElizaWszola	b35cc93420	Fix auto prefix bug (#3239 )	2024-03-07 16:37:28 -08:00
jacobthebanana	8cbba4622c	Possible fix for conflict between Automated Prefix Caching (#2762 ) and multi-LoRA support (#1804 ) (#3263 )	2024-03-07 23:03:22 +00:00
Woosuk Kwon	2daf23ab0c	Separate attention backends (#3005 )	2024-03-07 01:45:50 -08:00
Cade Daniel	a33ce60c66	[Testing] Fix core tests (#3224 )	2024-03-06 01:04:23 -08:00
SangBin Cho	24aecf421a	[Tests] Add block manager and scheduler tests (#3108 )	2024-03-05 18:23:34 -08:00
Nick Hill	8999ec3c16	Store `eos_token_id` in `Sequence` for easy access (#3166 )	2024-03-05 15:35:43 -08:00
Antoni Baum	ff578cae54	Add health check, make async Engine more robust (#3015 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-03-04 22:01:40 +00:00
Antoni Baum	22de45235c	Push logprob generation to LLMEngine (#3065 ) Co-authored-by: Avnish Narayan <avnish@anyscale.com>	2024-03-04 19:54:06 +00:00
Sage Moore	ce4f5a29fb	Add Automatic Prefix Caching (#2762 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-03-02 00:50:01 -08:00
Robert Shaw	c0c2335ce0	Integrate Marlin Kernels for Int4 GPTQ inference (#2497 ) Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com> Co-authored-by: alexm <alexm@neuralmagic.com>	2024-03-01 12:47:51 -08:00
felixzhu555	703e42ee4b	Add guided decoding for OpenAI API server (#2819 ) Co-authored-by: br3no <breno@veltefaria.de> Co-authored-by: simon-mo <simon.mo@hey.com>	2024-02-29 22:13:08 +00:00
Seonghyeon	bfdcfa6a05	Support starcoder2 architecture (#3089 )	2024-02-29 00:51:48 -08:00
Woosuk Kwon	929b4f2973	Add LoRA support for Gemma (#3050 )	2024-02-28 13:03:28 -08:00
Liangfu Chen	3b7178cfa4	[Neuron] Support inference with transformers-neuronx (#2569 )	2024-02-28 09:34:34 -08:00
Tao He	71bcaf99e2	Enable GQA support in the prefix prefill kernels (#3007 ) Signed-off-by: Tao He <sighingnow@gmail.com>	2024-02-27 01:14:31 -08:00
Dylan Hawk	e0ade06d63	Support logit bias for OpenAI API (#3027 )	2024-02-27 11:51:53 +08:00
Jared Moore	70f3e8e3a1	Add LogProbs for Chat Completions in OpenAI (#2918 )	2024-02-26 10:39:34 +08:00
Harry Mellor	ef978fe411	Port metrics from `aioprometheus` to `prometheus_client` (#2730 )	2024-02-25 11:54:00 -08:00
Ronen Schaffer	4caf7044e0	Include tokens from prompt phase in `counter_generation_tokens` (#2802 )	2024-02-22 14:00:12 -08:00
Woosuk Kwon	fd5dcc5c81	Optimize GeGLU layer in Gemma (#2975 )	2024-02-21 20:17:52 -08:00
Massimiliano Pronesti	93dc5a2870	chore(vllm): codespell for spell checking (#2820 )	2024-02-21 18:56:01 -08:00
Nick Hill	7d2dcce175	Support per-request seed (#2514 )	2024-02-21 11:47:00 -08:00
Antoni Baum	017d9f1515	Add metrics to RequestOutput (#2876 )	2024-02-20 21:55:57 -08:00
Zhuohan Li	63e2a6419d	[FIX] Fix beam search test (#2930 )	2024-02-20 14:37:39 -08:00
Ronen Schaffer	e433c115bc	Fix `vllm:prompt_tokens_total` metric calculation (#2869 )	2024-02-18 23:55:41 -08:00
Isotr0py	ab3a5a8259	Support OLMo models. (#2832 )	2024-02-18 21:05:15 -08:00
Zhuohan Li	a61f0521b8	[Test] Add basic correctness test (#2908 )	2024-02-18 16:44:50 -08:00
jvmncs	8f36444c4f	multi-LoRA as extra models in OpenAI server (#2775 ) how to serve the loras (mimicking the [multilora inference example](https://github.com/vllm-project/vllm/blob/main/examples/multilora_inference.py)): ```terminal $ export LORA_PATH=~/.cache/huggingface/hub/models--yard1--llama-2-7b-sql-lora-test/ $ python -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-7b-hf \ --enable-lora \ --lora-modules sql-lora=$LORA_PATH sql-lora2=$LORA_PATH ``` the above server will list 3 separate values if the user queries `/models`: one for the base served model, and one each for the specified lora modules. in this case sql-lora and sql-lora2 point to the same underlying lora, but this need not be the case. lora config values take the same values they do in EngineArgs no work has been done here to scope client permissions to specific models	2024-02-17 12:00:48 -08:00

1 2 3 4 5

244 Commits