squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Joe Runde	cfe712bf1a	[CI/Build] Use python 3.12 in cuda image (#8133 ) Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>	2024-09-07 13:03:16 -07:00
Isotr0py	e807125936	[Model][VLM] Support multi-images inputs for InternVL2 models (#8201 )	2024-09-07 16:38:23 +08:00
Cyrus Leung	9f68e00d27	[Bugfix] Fix broken OpenAI tensorizer test (#8258 )	2024-09-07 08:02:39 +00:00
youkaichao	ce2702a923	[tpu][misc] fix typo (#8260 )	2024-09-06 22:40:46 -07:00
Cyrus Leung	2f707fcb35	[Model] Multi-input support for LLaVA (#8238 )	2024-09-07 02:57:24 +00:00
Patrick von Platen	29f49cd6e3	[Model] Allow loading from original Mistral format (#8168 ) Co-authored-by: Michael Goin <michael@neuralmagic.com>	2024-09-06 17:02:05 -06:00
Alexey Kondratiev(AMD)	1447c97e75	[CI/Build] Increasing timeout for multiproc worker tests (#8203 )	2024-09-06 11:51:03 -07:00
afeldman-nm	e5cab71531	[Frontend] Add --logprobs argument to `benchmark_serving.py` (#8191 )	2024-09-06 09:01:14 -07:00
Jiaxin Shan	db3bf7c991	[Core] Support load and unload LoRA in api server (#6566 ) Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>	2024-09-05 18:10:33 -07:00
Alex Brooks	9da25a88aa	[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) (#8029 ) Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>	2024-09-05 12:48:10 +00:00
manikandan.tm@zucisystems.com	8685ba1a1e	Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) (#7860 )	2024-09-05 11:33:37 +00:00
Elfie Guo	e39ebf5cf5	[Core/Bugfix] Add query dtype as per FlashInfer API requirements. (#8173 )	2024-09-05 05:12:26 +00:00
Kyle Mistele	e02ce498be	[Feature] OpenAI-Compatible Tools API + Streaming for Hermes & Mistral models (#5649 ) Co-authored-by: constellate <constellate@1-ai-appserver-staging.codereach.com> Co-authored-by: Kyle Mistele <kyle@constellate.ai>	2024-09-04 13:18:13 -07:00
Woosuk Kwon	561d6f8077	[CI] Change test input in Gemma LoRA test (#8163 )	2024-09-04 13:05:50 -07:00
alexeykondrat	d1dec64243	[CI/Build][ROCm] Enabling LoRA tests on ROCm (#7369 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-09-04 11:57:54 -07:00
Cody Yu	2ad2e5608e	[MISC] Consolidate FP8 kv-cache tests (#8131 )	2024-09-04 18:53:25 +00:00
Cyrus Leung	855c262a6b	[Frontend] Multimodal support in offline chat (#8098 )	2024-09-04 05:22:17 +00:00
Peter Salas	2be8ec6e71	[Model] Add Ultravox support for multiple audio chunks (#7963 )	2024-09-04 04:38:21 +00:00
Dipika Sikka	2188a60c7e	[Misc] Update `GPTQ` to use `vLLMParameters` (#7976 )	2024-09-03 17:21:44 -04:00
Alexander Matveev	6d646d08a2	[Core] Optimize Async + Multi-step (#8050 )	2024-09-03 18:50:29 +00:00
wang.yuqi	6e36f4fa6c	improve chunked prefill performance [Bugfix] Fix #7592 vllm 0.5.4 enable_chunked_prefill throughput is slightly lower than 0.5.3~0.5.0. (#7874)	2024-09-02 14:20:12 -07:00
Lily Liu	e6a26ed037	[SpecDecode][Kernel] Flashinfer Rejection Sampling (#7244 )	2024-09-01 21:23:29 -07:00
Shawn Tan	f8d60145b4	[Model] Add Granite model (#7436 ) Co-authored-by: Nick Hill <nickhill@us.ibm.com>	2024-09-01 18:37:18 -07:00
Roger Wang	5b86b19954	[Misc] Optional installation of audio related packages (#8063 )	2024-09-01 14:46:57 -07:00
Roger Wang	5231f0898e	[Frontend][VLM] Add support for multiple multi-modal items (#8049 )	2024-08-31 16:35:53 -07:00
Pavani Majety	622f8abff8	[Bugfix] bugfix and add model test for flashinfer fp8 kv cache. (#8013 )	2024-08-30 22:18:50 -07:00
Wenxiang	1248e8506a	[Model] Adding support for MSFT Phi-3.5-MoE (#7729 ) Co-authored-by: Your Name <you@example.com> Co-authored-by: Zeqi Lin <zelin@microsoft.com> Co-authored-by: Zeqi Lin <Zeqi.Lin@microsoft.com>	2024-08-30 13:42:57 -06:00
Kaunil Dhruv	058344f89a	[Frontend]-config-cli-args (#7737 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Kaunil Dhruv <kaunil_dhruv@intuit.com>	2024-08-30 08:21:02 -07:00
Jungho Christopher Cho	f97be32d1d	[VLM][Model] TP support for ViTs (#7186 ) Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-08-30 08:19:27 -07:00
afeldman-nm	428dd1445e	[Core] Logprobs support in Multi-step (#7652 )	2024-08-29 19:19:08 -07:00
Cyrus Leung	4abed65c58	[VLM] Disallow overflowing `max_model_len` for multimodal models (#7998 )	2024-08-29 17:49:04 -07:00
chenqianfzh	4664ceaad6	support bitsandbytes 8-bit and FP4 quantized models (#7445 )	2024-08-29 19:09:08 -04:00
Pavani Majety	6b3421567d	[Core][Kernels] Enable FP8 KV Cache with Flashinfer backend. + BugFix for kv_cache_dtype=auto (#7985 ) Co-authored-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>	2024-08-29 14:53:11 -04:00
Alexander Matveev	3f60f2244e	[Core] Combine async postprocessor and multi-step (#7921 )	2024-08-29 11:18:26 -07:00
Jonas M. Kübler	f205c09854	[Bugfix] Unify rank computation across regular decoding and speculative decoding (#7899 )	2024-08-28 22:18:13 -07:00
youkaichao	ef99a78760	Revert "[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available." (#7982 )	2024-08-28 21:27:06 -07:00
Peter Salas	74d5543ec5	[VLM][Core] Fix exceptions on ragged NestedTensors (#7974 )	2024-08-29 03:24:31 +00:00
youkaichao	a7f65c2be9	[torch.compile] remove reset (#7975 )	2024-08-28 17:32:26 -07:00
youkaichao	ce6bf3a2cf	[torch.compile] avoid Dynamo guard evaluation overhead (#7898 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2024-08-28 16:10:12 -07:00
Mor Zusman	fdd9daafa3	[Kernel/Model] Migrate mamba_ssm and causal_conv1d kernels to vLLM (#7651 )	2024-08-28 15:06:52 -07:00
rasmith	e5697d161c	[Kernel] [Triton] [AMD] Adding Triton implementations awq_dequantize and awq_gemm to support AWQ (#7386 )	2024-08-28 15:37:47 -04:00
Pavani Majety	b98cc28f91	[Core][Kernels] Use FlashInfer backend for FP8 KV Cache when available. (#7798 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-08-28 10:01:22 -07:00
Cody Yu	e3580537a4	[Performance] Enable chunked prefill and prefix caching together (#7753 )	2024-08-28 00:36:31 -07:00
Cyrus Leung	51f86bf487	[mypy][CI/Build] Fix mypy errors (#7929 )	2024-08-27 23:47:44 -07:00
Peter Salas	fab5f53e2d	[Core][VLM] Stack multimodal tensors to represent multiple images within each prompt (#7902 )	2024-08-28 01:53:56 +00:00
zifeitong	5340a2dccf	[Model] Add multi-image input support for LLaVA-Next offline inference (#7230 )	2024-08-28 07:09:02 +08:00
Dipika Sikka	fc911880cc	[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7766 ) Co-authored-by: ElizaWszola <eliza@neuralmagic.com>	2024-08-27 15:07:09 -07:00
Isotr0py	9db642138b	[CI/Build][VLM] Cleanup multiple images inputs model test (#7897 )	2024-08-27 15:28:30 +00:00
Patrick von Platen	6fc4e6e07a	[Model] Add Mistral Tokenization to improve robustness and chat encoding (#7739 )	2024-08-27 12:40:02 +00:00
youkaichao	64cc644425	[core][torch.compile] discard the compile for profiling (#7796 )	2024-08-26 21:33:58 -07:00

1 2 3 4 5 ...

732 Commits