squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Cade Daniel	ab50275111	[Speculative decoding] Support target-model logprobs (#4378 )	2024-05-03 15:52:01 -07:00
Lily Liu	43c413ec57	[Kernel] Use flashinfer for decoding (#4353 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>	2024-05-03 15:51:27 -07:00
Sebastian Schoennenbeck	f8e7adda21	Fix/async chat serving (#2727 )	2024-05-03 11:04:14 -07:00
Michael Goin	7e65477e5e	[Bugfix] Allow "None" or "" to be passed to CLI for string args that default to None (#4586 )	2024-05-03 10:32:21 -07:00
SangBin Cho	3521ba4f25	[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518 )	2024-05-03 10:20:12 -07:00
youkaichao	2d7bce9cd5	[Doc] add env vars to the doc (#4572 )	2024-05-03 05:13:49 +00:00
DefTruth	ce3f1eedf8	[Misc] remove chunk detected debug logs (#4571 )	2024-05-03 04:48:08 +00:00
Yang, Bo	808632d3b4	[BugFix] Prevent the task of `_force_log` from being garbage collected (#4567 )	2024-05-03 01:35:18 +00:00
youkaichao	344a5d0c33	[Core][Distributed] enable allreduce for multiple tp groups (#4566 )	2024-05-02 17:32:33 -07:00
SangBin Cho	0f8a91401c	[Core] Ignore infeasible swap requests. (#4557 )	2024-05-02 14:31:20 -07:00
Alexei-V-Ivanov-AMD	9b5c9f9484	[CI/Build] AMD CI pipeline with extended set of tests. (#4267 ) Co-authored-by: simon-mo <simon.mo@hey.com>	2024-05-02 12:29:07 -07:00
Michał Moskal	32881f3f31	[kernel] fix sliding window in prefix prefill Triton kernel (#4405 ) Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2024-05-02 11:23:37 -07:00
youkaichao	5b8a7c1cb0	[Misc] centralize all usage of environment variables (#4548 )	2024-05-02 11:13:25 -07:00
Mark McLoughlin	1ff0c73a79	[BugFix] Include target-device specific requirements.txt in sdist (#4559 )	2024-05-02 10:52:51 -07:00
Hu Dong	5ad60b0cbd	[Misc] Exclude the `tests` directory from being packaged (#4552 )	2024-05-02 10:50:25 -07:00
SangBin Cho	fb087af52e	[mypy][7/N] Cover all directories (#4555 )	2024-05-02 10:47:41 -07:00
alexm-nm	7038e8b803	[Kernel] Support running GPTQ 8-bit models in Marlin (#4533 )	2024-05-02 12:56:22 -04:00
youkaichao	2a85f93007	[Core][Distributed] enable multiple tp group (#4512 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-05-02 04:28:21 +00:00
SangBin Cho	cf8cac8c70	[mypy][6/N] Fix all the core subdirectory typing (#4450 ) Co-authored-by: Cade Daniel <edacih@gmail.com>	2024-05-02 03:01:00 +00:00
Ronen Schaffer	5e401bce17	[CI]Add regression tests to ensure the async engine generates metrics (#4524 )	2024-05-01 19:57:12 -07:00
SangBin Cho	0d62fe58db	[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption (#4451 )	2024-05-01 19:24:13 -07:00
Danny Guinther	b8afa8b95a	[MISC] Rework logger to enable pythonic custom logging configuration to be provided (#4273 )	2024-05-01 17:34:40 -07:00
Woosuk Kwon	826b82a260	[Misc] Fix expert_ids shape in MoE (#4517 )	2024-05-01 23:47:59 +00:00
Philipp Moritz	c9d852d601	[Misc] Remove Mixtral device="cuda" declarations (#4543 ) Remove the device="cuda" declarations in mixtral as promised in #4343	2024-05-01 16:30:52 -07:00
youkaichao	6ef09b08f8	[Core][Distributed] fix pynccl del error (#4508 )	2024-05-01 15:23:06 -07:00
Roy	3a922c1e7e	[Bugfix][Core] Fix and refactor logging stats (#4336 )	2024-05-01 20:08:14 +00:00
sasha0552	c47ba4aaa9	[Bugfix] Add validation for seed (#4529 )	2024-05-01 19:31:22 +00:00
Philipp Moritz	24bb4fe432	[Kernel] Update fused_moe tuning script for FP8 (#4457 ) This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo. All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens. Before this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.1 ms ITL, 0.52s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 14.0 ms ITL, 0.70s e2e latency qps = 10: 15.7 ms ITL, 0.79s e2e latency After this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.2 ms ITL, 0.53s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 11.9 ms ITL, 0.59s e2e latency qps = 10: 12.1 ms ITL, 0.61s e2e latency	2024-05-01 11:47:38 -07:00
Nick Hill	a657bfc48a	[Core] Add `multiproc_worker_utils` for multiprocessing-based workers (#4357 )	2024-05-01 18:41:59 +00:00
leiwen83	24750f4cad	[Core] Enable prefix caching with block manager v2 enabled (#4142 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com> Co-authored-by: Sage Moore <sagemoore@utexas.edu>	2024-05-01 11:20:32 -07:00
leiwen83	b38e42fbca	[Speculative decoding] Add ngram prompt lookup decoding (#4237 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-05-01 11:13:03 -07:00
Travis Johnson	8b798eec75	[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-05-01 18:01:50 +00:00
sasha0552	69909126a7	[Bugfix] Use random seed if seed is -1 (#4531 )	2024-05-01 10:41:17 -07:00
Frαnçois	e491c7e053	[Doc] update(example model): for OpenAI compatible serving (#4503 )	2024-05-01 10:14:16 -07:00
Robert Shaw	4dc8026d86	[Bugfix] Fix 307 Redirect for `/metrics` (#4523 )	2024-05-01 09:14:13 -07:00
AnyISalIn	a88bb9b032	[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to obtain the CUDA version. (#4173 ) Signed-off-by: AnyISalIn <anyisalin@gmail.com>	2024-05-01 09:11:03 -07:00
SangBin Cho	6f1df80436	[Test] Add ignore_eos test (#4519 )	2024-05-01 08:45:42 -04:00
Jee Li	d6f4bd7cdd	[Misc]Add customized information for models (#4132 )	2024-04-30 21:18:14 -07:00
Robert Caulk	c3845d82dc	Allow user to define whitespace pattern for outlines (#4305 )	2024-04-30 20:48:39 -07:00
Pastel！	a822eb3413	[Misc] fix typo in block manager (#4453 )	2024-04-30 20:41:32 -07:00
harrywu	f458112e8a	[Misc][Typo] type annotation fix (#4495 )	2024-04-30 20:21:39 -07:00
Nick Hill	2e240c69a9	[Core] Centralize GPU Worker construction (#4419 )	2024-05-01 01:06:34 +00:00
fuchen.ljl	ee37328da0	Unable to find Punica extension issue during source code installation (#4494 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-05-01 00:42:09 +00:00
fuchen.ljl	6ad58f42c5	fix_tokenizer_snapshot_download_bug (#4493 )	2024-04-30 16:38:50 -07:00
Li, Jiang	dd1a50a8bc	[Bugfix][Minor] Make ignore_eos effective (#4468 )	2024-04-30 16:33:33 -07:00
Alpay Ariyak	715c2d854d	[Frontend] [Core] Tensorizer: support dynamic `num_readers`, update version (#4467 )	2024-04-30 16:32:13 -07:00
Florian Greinacher	a494140433	[Frontend] Support complex message content for chat completions endpoint (#3467 ) Co-authored-by: Lily Liu <lilyliupku@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>	2024-04-30 16:28:46 -07:00
Robert Shaw	111815d482	[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: mgoin <michael@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-04-30 21:46:12 +00:00
Prashant Gupta	b31a1fb63c	[Doc] add visualization for multi-stage dockerfile (#4456 ) Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-04-30 17:41:59 +00:00
leiwen83	4bb53e2dde	[BugFix] fix num_lookahead_slots missing in async executor (#4165 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-04-30 10:12:59 -07:00

1 2 3 4 5 ...

1267 Commits