squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Thomas Parnell	8a1415cf77	[Bugfix] GPTBigCodeForCausalLM: Remove lm_head from supported_lora_modules. (#6326 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Travis Johnson <tsjohnso@us.ibm.com>	2024-07-11 07:05:59 -07:00
pushan	546b101fa0	[BugFix]: fix engine timeout due to request abort (#6255 ) Signed-off-by: yatta zhang <ytzhang01@foxmail.com> Signed-off-by: zhangyuntao.dev <zhangyuntao.dev@bytedance.com> Co-authored-by: zhangyuntao.dev <zhangyuntao.dev@bytedance.com>	2024-07-11 06:46:31 -07:00
aniaan	3963a5335b	[Misc] refactor(config): clean up unused code (#6320 )	2024-07-11 09:39:07 +00:00
daquexian	99ded1e1c4	[Doc] Remove comments incorrectly copied from another project (#6286 )	2024-07-10 17:05:26 -07:00
Woosuk Kwon	997df46a32	[Bugfix][Neuron] Fix soft prompt method error in NeuronExecutor (#6313 )	2024-07-10 16:39:02 -07:00
sroy745	ae151d73be	[Speculative Decoding] Enabling bonus token in speculative decoding for KV cache based models (#5765 )	2024-07-10 16:02:47 -07:00
sangjune.park	44cc76610d	[Bugfix] Fix OpenVINOExecutor abstractmethod error (#6296 ) Signed-off-by: sangjune.park <sangjune.park@navercorp.com>	2024-07-10 10:03:32 -07:00
Benjamin Muskalla	b422d4961a	[CI/Build] Enable mypy typing for remaining folders (#6268 )	2024-07-10 22:15:55 +08:00
Thomas Parnell	c38eba3046	[Bugfix] MLPSpeculator: Use ParallelLMHead in tie_weights=False case. (#6303 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-07-10 09:04:07 -04:00
Woosuk Kwon	e72ae80b06	[Bugfix] Support 2D input shape in MoE layer (#6287 )	2024-07-10 09:03:16 -04:00
Cyrus Leung	8a924d2248	[Doc] Guide for adding multi-modal plugins (#6205 )	2024-07-10 14:55:34 +08:00
Woosuk Kwon	5ed3505d82	[Bugfix][TPU] Add prompt adapter methods to TPUExecutor (#6279 )	2024-07-09 19:30:56 -07:00
youkaichao	da78caecfa	[core][distributed] zmq fallback for broadcasting large objects (#6183 ) [core][distributed] add zmq fallback for broadcasting large objects (#6183)	2024-07-09 18:49:11 -07:00
Abhinav Goyal	2416b26e11	[Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978 )	2024-07-09 18:34:02 -07:00
Baoyuan Qi	d3a245138a	[Bugfix]fix and needs_scalar_to_array logic check (#6238 ) Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>	2024-07-09 23:43:24 +00:00
Swapnil Parekh	4d6ada947c	[CORE] Adding support for insertion of soft-tuned prompts (#4645 ) Co-authored-by: Swapnil Parekh <swapnilp@ibm.com> Co-authored-by: Joe G <joseph.granados@h2o.ai> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-07-09 13:26:36 -07:00
Woosuk Kwon	5d5b4c5fe5	[Bugfix][TPU] Add missing None to model input (#6245 )	2024-07-09 00:21:37 -07:00
youkaichao	70c232f85a	[core][distributed] fix ray worker rank assignment (#6235 )	2024-07-08 21:31:44 -07:00
youkaichao	a3c9435d93	[hardware][cuda] use device id under CUDA_VISIBLE_DEVICES for get_device_capability (#6216 )	2024-07-08 20:02:15 -07:00
tomeras91	ddc369fba1	[Bugfix] Mamba cache Cuda Graph padding (#6214 )	2024-07-08 11:25:51 -07:00
Eric	185ad31f37	[Bugfix] use diskcache in outlines _get_guide #5436 (#6203 )	2024-07-08 11:23:24 -07:00
afeldman-nm	543aa48573	[Kernel] Correctly invoke prefill & decode kernels for cross-attention (towards eventual encoder/decoder model support) (#4888 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-07-08 17:12:15 +00:00
Avshalom Manevich	f7a8fa39d8	[Kernel] reloading fused_moe config on the last chunk (#6210 )	2024-07-08 08:00:38 -07:00
kczimm	16620f439d	do not exclude `object` field in CompletionStreamResponse (#6196 )	2024-07-08 10:32:57 +08:00
youkaichao	3b08fe2b13	[misc][frontend] log all available endpoints (#6195 ) Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-07-07 15:11:12 -07:00
Robert Shaw	abfe705a02	[ Misc ] Support Fp8 via `llm-compressor` (#6110 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-07-07 20:42:11 +00:00
Roger Wang	6206dcb29e	[Model] Add PaliGemma (#5189 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-07-07 09:25:50 +08:00
Cyrus Leung	9389380015	[Doc] Move guide for multimodal model and other improvements (#6168 )	2024-07-06 17:18:59 +08:00
Simon Mo	abad5746a7	bump version to v0.5.1 (#6157 )	2024-07-05 12:04:51 -07:00
JGSweets	e58294ddf2	[Bugfix] Add verbose error if scipy is missing for blocksparse attention (#5695 )	2024-07-05 10:41:01 -07:00
jvlunteren	f1e15da6fe	[Frontend] Continuous usage stats in OpenAI completion API (#5742 )	2024-07-05 10:37:09 -07:00
Cyrus Leung	ea4b570483	[VLM] Cleanup validation and update docs (#6149 )	2024-07-05 05:49:38 +00:00
Roger Wang	a41357e941	[VLM] Improve consistency between feature size calculation and dummy data for profiling (#6146 )	2024-07-05 09:29:47 +08:00
Cyrus Leung	ae96ef8fbd	[VLM] Calculate maximum number of multi-modal tokens by model (#6121 )	2024-07-04 16:37:23 -07:00
Lily Liu	69ec3ca14c	[Kernel][Model] logits_soft_cap for Gemma2 with flashinfer (#6051 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-07-04 16:35:51 -07:00
Yuan	81d7a50f24	[Hardware][Intel CPU] Adding intel openmp tunings in Docker file (#6008 ) Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>	2024-07-04 15:22:12 -07:00
Gregory Shtrasberg	56b325e977	[ROCm][AMD][Model]Adding alibi slopes support in ROCm triton flash attention and naive flash attention (#6043 ) Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>	2024-07-03 22:19:38 -07:00
Cyrus Leung	3dd507083f	[CI/Build] Cleanup VLM tests (#6107 )	2024-07-03 18:58:18 -07:00
Murali Andoorveedu	0ed646b7aa	[Distributed][Core] Support Py39 and Py38 for PP (#6120 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-03 17:52:29 -07:00
Travis Johnson	1dab9bc8a9	[Bugfix] set OMP_NUM_THREADS to 1 by default for multiprocessing (#6109 ) Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com> Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-07-03 16:56:59 -07:00
youkaichao	3de6e6a30e	[core][distributed] support n layers % pp size != 0 (#6115 )	2024-07-03 16:40:31 -07:00
Robert Shaw	62963d129e	[ Misc ] Clean Up `CompressedTensorsW8A8` (#6113 )	2024-07-03 22:50:08 +00:00
xwjiang2010	d9e98f42e4	[vlm] Remove vision language config. (#6089 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-03 22:14:16 +00:00
youkaichao	3c6325f0fc	[core][distributed] custom allreduce when pp size > 1 (#6117 )	2024-07-03 14:41:32 -07:00
Michael Goin	47f0954af0	[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (#5975 )	2024-07-03 17:38:00 +00:00
Roger Wang	7cd2ebb025	[Bugfix] Fix `compute_logits` in Jamba (#6093 )	2024-07-03 00:32:35 -07:00
Roger Wang	3a86b54fb0	[VLM][Frontend] Proper Image Prompt Formatting from OpenAI API (#6091 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-07-02 23:41:23 -07:00
youkaichao	f666207161	[misc][distributed] error on invalid state (#6092 )	2024-07-02 23:37:29 -07:00
Nick Hill	d830656a97	[BugFix] Avoid unnecessary Ray import warnings (#6079 )	2024-07-03 14:09:40 +08:00
Cyrus Leung	9831aec49f	[Core] Dynamic image size support for VLMs (#5276 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: ywang96 <ywang@roblox.com> Co-authored-by: xwjiang2010 <87673679+xwjiang2010@users.noreply.github.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>	2024-07-02 20:34:00 -07:00

1 2 3 4 5 ...

1158 Commits