squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Cyrus Leung	eecd864388	[Bugfix][CI/Build] Fix test and improve code for `merge_async_iterators` (#5096 )	2024-05-29 16:02:25 -07:00
Ronen Schaffer	ae495c74ea	[Doc]Replace deprecated flag in readme (#4526 )	2024-05-29 22:26:33 +00:00
afeldman-nm	4238bc82f2	[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837 )	2024-05-29 16:09:13 +00:00
youkaichao	594392d27a	[Core][Distributed] improve p2p access check (#4992 )	2024-05-29 11:29:07 +00:00
Cyrus Leung	18c1f16d86	[Bugfix] Fix arguments passed to `Sequence` in stop checker test (#5092 )	2024-05-29 07:16:41 +00:00
youkaichao	5bd3c65072	[Core][Optimization] remove vllm-nccl (#5091 )	2024-05-29 05:13:52 +00:00
Marut Pandya	616e600e0b	[Misc] add gpu_memory_utilization arg (#5079 ) Signed-off-by: pandyamarut <pandyamarut@gmail.com>	2024-05-28 17:16:18 -07:00
Junichi Sato	dfba529b40	[Bugfix] Remove the last EOS token unless explicitly specified (#5077 )	2024-05-28 17:15:35 -07:00
Cyrus Leung	5ae5ed1e60	[Core] Consolidate prompt arguments to LLM engines (#4328 ) Co-authored-by: Roger Wang <ywang@roblox.com>	2024-05-28 13:29:31 -07:00
Simon Mo	290f4ada2b	[Docs] Add Dropbox as sponsors (#5089 )	2024-05-28 10:29:09 -07:00
Divakar Verma	dd8de11f0a	[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951 ) This PR adds Triton kernel configs for the MoE kernel for MI300X	2024-05-28 16:03:23 +00:00
Robert Shaw	9ba415588a	[BugFix] Fix Embedding Models with TP>1 (#5075 )	2024-05-28 08:32:42 -07:00
Michał Moskal	d4f3985907	[Core] Sliding window for block manager v2 (#4545 ) Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>	2024-05-28 11:07:07 +09:00
Isotr0py	890aa93d27	[Model] Add support for falcon-11B (#5069 )	2024-05-27 16:41:43 -07:00
sasha0552	fbdb7b3ee2	[Core] Allow AQLM on Pascal (#5058 )	2024-05-27 15:26:14 -07:00
Zhuohan Li	1102bef219	[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846 ) Co-authored-by: rsnm2 <rshaw@neuralmagic.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-05-27 15:18:17 -07:00
Roger Wang	f17a1a8f96	[Misc] Make Serving Benchmark More User-friendly (#5044 )	2024-05-25 17:28:16 +00:00
Lily Liu	d5a1697772	[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000 )	2024-05-25 10:00:14 -07:00
youkaichao	325c119961	[Misc] add logging level env var (#5045 )	2024-05-24 23:49:49 -07:00
Eric Xihui Lin	8e192ff967	[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3-Small model (#4799 ) Co-authored-by: beagleski <yunanzhang@microsoft.com> Co-authored-by: bapatra <bapatra@microsoft.com> Co-authored-by: Barun Patra <codedecde@users.noreply.github.com> Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-05-24 22:00:52 -07:00
leiwen83	e64fde4b01	[Core][Bugfix]: fix prefix caching for blockv2 (#4764 ) Co-authored-by: Lei Wen <wenlei03@qiyi.com>	2024-05-24 10:07:09 -07:00
Robert Shaw	919770957f	[Bugfix] Fix Mistral v0.3 Weight Loading (#5005 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-05-24 12:28:27 +00:00
youkaichao	6a50f4cafa	[Doc] add ccache guide in doc (#5012 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-05-23 23:21:54 +00:00
Elisei Smirnov	e3470f8753	[Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985 ) Co-authored-by: Elisei Smirnov <el.smirnov@innopolis.university>	2024-05-23 22:04:24 +00:00
Dipika Sikka	a1242324c9	[Kernel] Initial Activation Quantization Support (#4525 ) Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>	2024-05-23 21:29:18 +00:00
Murali Andoorveedu	5eda2ea02a	[Core][1/N] Support send/recv in PyNCCL Groups (#4988 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-05-23 09:54:48 -07:00
Letian Li	2ba80bed27	[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is not defined (#5009 )	2024-05-23 09:08:58 -07:00
Alexander Matveev	6066253296	Marlin 24 prefill performance improvement (about 25% better on average) (#4983 )	2024-05-23 02:39:27 -04:00
Cody Yu	ee3eea0a1b	[Misc] Take user preference in attention selector (#4960 )	2024-05-23 07:55:56 +09:00
Philipp Moritz	a36de682d4	[Minor] Fix small typo in llama.py: QKVParallelLinear -> QuantizationConfig (#4991 )	2024-05-22 22:26:56 +00:00
Nick Hill	eb6d3c264d	[Core] Eliminate parallel worker per-step task scheduling overhead (#4894 )	2024-05-23 06:17:27 +09:00
raywanb	97b030005c	[Model] LoRA gptbigcode implementation (#3949 )	2024-05-22 13:58:59 -07:00
Cody Yu	a3a73ab069	[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893 ) The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).	2024-05-22 13:28:20 -07:00
Tyler Michael Smith	8674f9880e	[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954 ) Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs	2024-05-22 14:10:43 +00:00
SangBin Cho	c74c913bfb	[misc] remove comments that were supposed to be removed (#4977 )	2024-05-22 09:02:58 -04:00
Michael Goin	5f6d10c14c	[CI/Build] Enforce style for C++ and CUDA code with `clang-format` (#4722 )	2024-05-22 07:18:41 +00:00
sasha0552	9b9a10d6cb	[Frontend] Dynamic RoPE scaling (#4638 )	2024-05-22 01:32:35 -04:00
Isotr0py	99eff67ba9	[Bugfix][Kernel] Add head size check for attention backend selection (#4944 )	2024-05-21 15:33:25 -04:00
Kante Yin	14772eeb8e	[Bugfix] Fix flag name for `max_seq_len_to_capture` (#4935 ) Signed-off-by: kerthcet <kerthcet@gmail.com>	2024-05-21 09:30:52 -07:00
Michael Goin	757b62c495	[CI/Build] Codespell ignore `build/` directory (#4945 )	2024-05-21 09:06:10 -07:00
Simon Mo	e941f88584	[Docs] Add acknowledgment for sponsors (#4925 )	2024-05-21 00:17:25 -07:00
Isotr0py	f12c3b5b3d	[Model] Add Phi-2 LoRA support (#4886 )	2024-05-21 14:24:17 +09:00
HUANG Fei	d130b573a0	[Model] add rope_scaling support for qwen2 (#4930 )	2024-05-21 05:22:22 +00:00
Antoni Baum	65ae8c2c8f	[Core] Fix scheduler considering "no LoRA" as "LoRA" (#4897 )	2024-05-20 17:48:32 -07:00
Kuntai Du	c3af44722c	[Doc]Add documentation to benchmarking script when running TGI (#4920 )	2024-05-20 20:16:57 +00:00
Aurick Qiao	1937e29848	[Core] Sharded State Loader download from HF (#4889 )	2024-05-20 11:46:12 -07:00
Mor Zusman	f0eecee610	[Bugfix] Fix dummy weight for fp8 (#4916 ) Allow dummy load format for fp8, torch.uniform_ doesn't support FP8 at the moment Co-authored-by: Mor Zusman <morz@ai21.com>	2024-05-20 18:44:25 +00:00
Alexei-V-Ivanov-AMD	943e72ca56	[Build/CI] Enabling AMD Entrypoints Test (#4834 ) Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>	2024-05-20 11:29:28 -07:00
Wenwei Zhang	546a97ef69	[Misc]: allow user to specify port in distributed setting (#4914 )	2024-05-20 17:45:06 +00:00
Alexander Matveev	da5a0b539d	Remove marlin warning (#4918 )	2024-05-20 14:55:34 +00:00

1 2 3 4 5 ...

1423 Commits