squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Yangcheng Li	36fc439de0	[Doc] fix doc string typo in block_manager `swap_out` function (#10212 )	2024-11-11 08:53:07 -08:00
Nicolò Lucchesi	9d43afcc53	[Feature] [Spec decode]: Combine chunked prefill with speculative decoding (#9291 ) Signed-off-by: NickLucche <nlucches@redhat.com>	2024-11-07 08:15:14 -08:00
Konrad Zawora	a02a50e6e5	[Hardware][Intel-Gaudi] Add Intel Gaudi (HPU) inference backend (#6143 ) Signed-off-by: yuwenzho <yuwen.zhou@intel.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Signed-off-by: Bob Zhu <bob.zhu@intel.com> Signed-off-by: zehao-intel <zehao.huang@intel.com> Signed-off-by: Konrad Zawora <kzawora@habana.ai> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Sanju C Sudhakaran <scsudhakaran@habana.ai> Co-authored-by: Michal Adamczyk <madamczyk@habana.ai> Co-authored-by: Marceli Fylcek <mfylcek@habana.ai> Co-authored-by: Himangshu Lahkar <49579433+hlahkar@users.noreply.github.com> Co-authored-by: Vivek Goel <vgoel@habana.ai> Co-authored-by: yuwenzho <yuwen.zhou@intel.com> Co-authored-by: Dominika Olszewska <dolszewska@habana.ai> Co-authored-by: barak goldberg <149692267+bgoldberg-habana@users.noreply.github.com> Co-authored-by: Michal Szutenberg <37601244+szutenberg@users.noreply.github.com> Co-authored-by: Jan Kaniecki <jkaniecki@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyniewicz-habana@users.noreply.github.com> Co-authored-by: Krzysztof Wisniewski <kwisniewski@habana.ai> Co-authored-by: Dudi Lester <160421192+dudilester@users.noreply.github.com> Co-authored-by: Ilia Taraban <tarabanil@gmail.com> Co-authored-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai> Co-authored-by: Jakub Maksymczuk <jmaksymczuk@habana.ai> Co-authored-by: Tomasz Zielinski <85164140+tzielinski-habana@users.noreply.github.com> Co-authored-by: Sun Choi <schoi@habana.ai> Co-authored-by: Iryna Boiko <iboiko@habana.ai> Co-authored-by: Bob Zhu <41610754+czhu15@users.noreply.github.com> Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> Co-authored-by: Zehao Huang <zehao.huang@intel.com> Co-authored-by: Andrzej Kotłowski <Andrzej.Kotlowski@intel.com> Co-authored-by: Yan Tomsinsky <73292515+Yantom1@users.noreply.github.com> Co-authored-by: Nir David <ndavid@habana.ai> Co-authored-by: Yu-Zhou <yu.zhou@intel.com> Co-authored-by: Ruheena Suhani Shaik <rsshaik@habana.ai> Co-authored-by: Karol Damaszke <kdamaszke@habana.ai> Co-authored-by: Marcin Swiniarski <mswiniarski@habana.ai> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Jacek Czaja <jacek.czaja@intel.com> Co-authored-by: Jacek Czaja <jczaja@habana.ai> Co-authored-by: Yuan <yuan.zhou@outlook.com>	2024-11-06 01:09:10 -08:00
Aaron Pham	21063c11c7	[CI/Build] drop support for Python 3.8 EOL (#8464 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz>	2024-11-06 07:11:55 +00:00
Peter Salas	6c0b7f548d	[Core][VLM] Add precise multi-modal placeholder tracking (#8346 ) Signed-off-by: Peter Salas <peter@fixie.ai>	2024-11-01 16:21:10 -07:00
André Jonasson	4581d2cc02	[Core] Refactor: Clean up unused argument in Scheduler._preempt (#9696 ) Signed-off-by: André Jonasson <andre.jonasson@gmail.com>	2024-11-01 11:41:38 -07:00
youkaichao	4fdc581f9e	[core] simplify seq group code (#9569 ) Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>	2024-10-24 00:16:44 -07:00
Kuntai Du	ca30c3c84b	[Core] Remove evictor_v1 (#9572 )	2024-10-22 04:55:49 +00:00
Cyrus Leung	051eaf6db3	[Model] Add user-configurable task for models that support both generation and embedding (#9424 )	2024-10-18 11:31:58 -07:00
Kuntai Du	81ede99ca4	[Core] Deprecating block manager v1 and make block manager v2 default (#8704 ) Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).	2024-10-17 11:38:15 -05:00
Russell Bryant	776dbd74f1	[CI/Build] mypy: Resolve some errors from checking vllm/engine (#9267 ) Signed-off-by: Russell Bryant <rbryant@redhat.com>	2024-10-16 22:55:59 +00:00
homeffjy	1a1823871d	[Doc] Remove outdated comment to avoid misunderstanding (#9287 )	2024-10-11 18:02:03 +00:00
Tyler Michael Smith	7342a7d7f8	[Model] Support Mamba (#6484 )	2024-10-11 15:40:06 +00:00
youkaichao	cbc2ef5529	[misc] hide best_of from engine (#9261 ) Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>	2024-10-10 21:30:44 -07:00
Alex Brooks	a3691b6b5e	[Core][Frontend] Add Support for Inference Time mm_processor_kwargs (#9131 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>	2024-10-08 14:12:56 +00:00
youkaichao	fa45513a51	[misc] fix comment and variable name (#9139 )	2024-10-07 16:07:05 -07:00
youkaichao	18b296fdb2	[core] remove beam search from the core (#9105 )	2024-10-07 05:47:04 +00:00
sroy745	c8f26bb636	[BugFix][Core] Fix BlockManagerV2 when Encoder Input is None (#9103 )	2024-10-07 03:52:42 +00:00
afeldman-nm	563649aafe	[Core] Combined support for multi-step scheduling, chunked prefill & prefix caching (#8804 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com> Co-authored-by: Andrew Feldman <afeld2012@gmail.com>	2024-10-02 07:52:20 +00:00
juncheoll	1fb9c1b0bf	[Misc] Fix typo in BlockSpaceManagerV1 (#8944 )	2024-09-29 15:05:54 +00:00
sroy745	5bf8789b2a	[Bugfix] Block manager v2 with preemption and lookahead slots (#8824 )	2024-09-29 09:17:45 +08:00
Varun Sundar Rabindranath	c2ec430ab5	[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (#8378 ) Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-09-27 13:32:07 -07:00
Woo-Yeon Lee	8fae5ed7f6	[Misc] Fix minor typo in scheduler (#8765 )	2024-09-25 00:53:03 -07:00
Archit Patke	6da1ab6b41	[Core] Adding Priority Scheduling (#5958 )	2024-09-24 19:50:50 -07:00
Aaron Pham	9d104b5beb	[CI/Build] Update Ruff version (#8469 ) Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-09-18 11:00:56 +00:00
Alexander Matveev	4ef41b8476	[Bugfix] Fix async postprocessor in case of preemption (#8267 )	2024-09-07 21:01:51 -07:00
wang.yuqi	6e36f4fa6c	improve chunked prefill performance [Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874)	2024-09-02 14:20:12 -07:00
Alexander Matveev	3f60f2244e	[Core] Combine async postprocessor and multi-step (#7921 )	2024-08-29 11:18:26 -07:00
Cody Yu	e3580537a4	[Performance] Enable chunked prefill and prefix caching together (#7753 )	2024-08-28 00:36:31 -07:00
Alexander Matveev	f508e03e7f	[Core] Async_output_proc: Add virtual engine support (towards pipeline parallel) (#7911 )	2024-08-28 00:02:30 -07:00
youkaichao	bc6e42a9b1	[hardware][rocm] allow rocm to override default env var (#7926 )	2024-08-27 19:50:06 -07:00
Jonathan Berkhahn	9c71c97ae2	[mypy] Enable mypy type checking for `vllm/core` (#7229 )	2024-08-28 07:11:14 +08:00
Megha Agarwal	2eedede875	[Core] Asynchronous Output Processor (#7049 ) Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>	2024-08-26 20:53:20 -07:00
Cody Yu	2deb029d11	[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822 )	2024-08-26 11:24:53 -07:00
Cody Yu	3ac50b47d0	[MISC] Add prefix cache hit rate to metrics (#7606 )	2024-08-19 11:52:07 -07:00
SangBin Cho	ff7ec82c4d	[Core] Optimize SPMD architecture with delta + serialization optimization (#7109 )	2024-08-18 17:57:20 -07:00
Mahesh Keralapura	93478b63d2	[Core] Fix tracking of model forward time in case of PP>1 (#7440 ) [Core] Fix tracking of model forward time to the span traces in case of PP>1 (#7440)	2024-08-16 13:46:01 -07:00
William Lin	2ecf7b1757	[core] [3/N] multi-step args and sequence.py (#7452 )	2024-08-14 12:32:45 -07:00
Cade Daniel	baa240252e	[Core] Fix edge case in chunked prefill + block manager v2 (#7380 )	2024-08-09 23:48:49 +00:00
Mahesh Keralapura	933790c209	[Core] Add span metrics for model_forward, scheduler and sampler time (#7089 )	2024-08-09 13:55:13 -07:00
Alexander Matveev	fc7b8d1eef	[Performance] e2e overheads reduction: Small followup diff (#7364 )	2024-08-09 15:49:36 +00:00
Alexander Matveev	e02ac55617	[Performance] Optimize e2e overheads: Reduce python allocations (#7162 )	2024-08-08 21:34:28 -07:00
Zach Zheng	782e53ab59	[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849 )	2024-08-08 10:43:30 -07:00
Rui Qiao	746709642c	[Misc] Fix typos in scheduler.py (#7285 ) Signed-off-by: Rui Qiao <ruisearch42@gmail.com>	2024-08-07 17:06:01 -07:00
afeldman-nm	fd95e026e0	[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) (#4942 ) Co-authored-by: Andrew Feldman <afeld2012@gmail.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-08-06 16:51:47 -04:00
xiaobochen123	660470e5a3	[Core] Optimize evictor-v2 performance (#7193 )	2024-08-06 12:34:25 -07:00
Woosuk Kwon	6ce01f3066	[Performance] Optimize `get_seqs` (#7051 )	2024-08-01 18:29:52 -07:00
youkaichao	c8a7e93273	[core][scheduler] simplify and improve scheduler (#6867 )	2024-07-31 23:51:09 -07:00
youkaichao	6ca8031e71	[core][misc] improve free_finished_seq_groups (#6865 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-07-30 14:32:12 -07:00
Nick Hill	5cf9254a9c	[BugFix] Fix use of per-request seed with pipeline parallel (#6698 )	2024-07-30 10:40:08 -07:00

1 2 3 4

151 Commits