squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Cody Yu	f7dac83d95	[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (#5939 )	2024-06-29 21:04:20 +08:00
Antoni Baum	7c01f70641	[Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum (#5974 )	2024-06-29 12:47:53 +00:00
Cyrus Leung	51e971d39e	[Bugfix] Support `eos_token_id` from `config.json` (#5954 )	2024-06-29 11:19:02 +00:00
Roger Wang	329df38f1a	[Misc] Update Phi-3-Vision Example (#5981 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-06-29 14:34:29 +08:00
Woosuk Kwon	580353da93	[Bugfix] Fix precisions in Gemma 1 (#5913 )	2024-06-29 03:10:21 +00:00
Joe Runde	ba4994443a	[Kernel] Add punica dimensions for Granite 3b and 8b (#5930 ) Signed-off-by: Joe Runde <joe@joerun.de>	2024-06-29 10:48:25 +08:00
William Lin	906a19cdb0	[Misc] Extend vLLM Metrics logging API (#5925 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-29 10:36:06 +08:00
mcalman	c4bca740e8	[Bugfix] fix missing last itl in openai completions benchmark (#5926 )	2024-06-29 10:34:42 +08:00
Woosuk Kwon	7f83f40dee	[Bugfix][TPU] Fix pad slot id (#5977 )	2024-06-28 18:55:17 -07:00
Woosuk Kwon	54814fd85b	[Bugfix][TPU] Fix TPU sampler output (#5978 )	2024-06-28 18:14:16 -07:00
Lily Liu	7041de4384	[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>	2024-06-28 15:28:49 -07:00
Robert Shaw	6a62cb82cc	[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (#5963 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-28 17:46:30 -04:00
Tyler Michael Smith	5d2a1a9cf0	Unmark more files as executable (#5962 )	2024-06-28 17:34:56 -04:00
Michael Goin	4bf35ed9ae	[Bugfix] Only add `Attention.kv_scale` if kv cache quantization is enabled (#5936 )	2024-06-28 21:12:40 +00:00
wangding zeng	be0b3af9e0	Support Deepseek-V2 (#4650 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2024-06-28 13:24:57 -07:00
Robert Shaw	2cd402e169	[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-28 18:43:49 +00:00
Robert Shaw	b185230744	[ Misc ] Remove `fp8_shard_indexer` from Col/Row Parallel Linear (Simplify Weight Loading) (#5928 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-28 13:49:57 -04:00
Tyler Michael Smith	6a2d659d28	[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931 )	2024-06-28 17:10:34 +00:00
Cody Yu	b2c620230a	[Spec Decode] Introduce DraftModelRunner (#5799 )	2024-06-28 09:17:51 -07:00
xwjiang2010	b90d8cd832	[Distributed] Make it clear that % should not be in tensor dict keys. (#5927 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2024-06-28 15:20:22 +00:00
Cyrus Leung	3b752a6555	[CI/Build] [2/3] Reorganize entrypoints tests (#5904 )	2024-06-28 07:59:18 -07:00
Thomas Parnell	ec1ad0046c	[Bugfix] Better error message for MLPSpeculator when `num_speculative_tokens` is set too high (#5894 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>	2024-06-28 07:42:17 -07:00
Ilya Lavrenov	57f09a419c	[Hardware][Intel] OpenVINO vLLM backend (#5379 )	2024-06-28 13:50:16 +00:00
Tyler Michael Smith	5932634409	Unmark fused_moe config json file as executable (#5960 )	2024-06-28 06:36:12 -07:00
Cyrus Leung	5cbe8d155c	[Core] Registry for processing model inputs (#5214 ) Co-authored-by: ywang96 <ywang@roblox.com>	2024-06-28 12:09:56 +00:00
Isotr0py	0d0e3a42ac	[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (#5956 )	2024-06-28 12:03:41 +00:00
xwjiang2010	74d55c065b	[VLM][BugFix] Make sure that `multi_modal_kwargs` can broadcast properly with ring buffer. (#5905 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-06-28 07:29:13 +00:00
Woosuk Kwon	f136da15e1	[Hardware][TPU] Optimize KV cache swapping (#5878 )	2024-06-27 21:12:13 -07:00
Divakar Verma	c3dde367f1	[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932 )	2024-06-27 13:41:08 -07:00
youkaichao	64e8d2a783	[core][misc] remove logical block (#5882 )	2024-06-27 13:34:55 -07:00
Woosuk Kwon	79c92c7c8a	[Model] Add Gemma 2 (#5908 )	2024-06-27 13:33:56 -07:00
Roger Wang	736ed38849	[CI/Build] Fix Args for `_get_logits_warper` in Sampler Test (#5922 )	2024-06-27 11:43:04 -07:00
Nick Hill	365791ff81	[BugFix] Fix `min_tokens` behaviour for multiple eos tokens (#5849 )	2024-06-27 11:31:11 -07:00
Nick Hill	691e29ecf3	[BugFix] Fix `MLPSpeculator` handling of `num_speculative_tokens` (#5876 )	2024-06-27 10:59:33 -07:00
youkaichao	3fd02bda51	[doc][misc] add note for Kubernetes users (#5916 )	2024-06-27 10:07:07 -07:00
Cyrus Leung	98cf2ed678	[Model][Bugfix] Implicit model flags and reenable Phi-3-Vision (#5896 )	2024-06-27 09:08:10 -07:00
Cyrus Leung	e9d32d077d	[CI/Build] [1/3] Reorganize entrypoints tests (#5526 )	2024-06-27 12:43:17 +00:00
Roger Wang	2061f0b8a7	[Bugfix] Fix img_sizes Parsing in Phi3-Vision (#5888 )	2024-06-27 08:29:24 +00:00
Cyrus Leung	96354d6a29	[Model] Add base class for LoRA-supported models (#5018 )	2024-06-27 16:03:04 +08:00
xwjiang2010	d12af207d2	[VLM][Bugfix] Make sure that `multi_modal_kwargs` is broadcasted properly (#5880 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>	2024-06-27 15:15:24 +08:00
Cyrus Leung	6eabc6cb0e	[Doc] Add note about context length in Phi-3-Vision example (#5887 )	2024-06-26 23:20:01 -07:00
Nick Hill	2110557dab	[BugFix] Fix cuda graph for MLPSpeculator (#5875 ) Co-authored-by: Abhinav Goyal <abhinav.goyal@flipkart.com>	2024-06-27 04:12:10 +00:00
Roger Wang	b9e84259e9	[Misc] Add example for LLaVA-NeXT (#5879 )	2024-06-26 17:57:16 -07:00
youkaichao	294104c3f9	[doc] update usage of env var to avoid conflict (#5873 )	2024-06-26 17:57:12 -04:00
Chip Kerchner	38a1674abb	Support CPU inference with VSX PowerPC ISA (#5652 )	2024-06-26 21:53:04 +00:00
Woosuk Kwon	f5c8628fdc	[Bugfix][TPU] Fix CPU cache allocation (#5869 )	2024-06-26 13:42:40 -07:00
Woosuk Kwon	cbc53b6b8d	[Hardware][TPU] Support parallel sampling & Swapping (#5855 )	2024-06-26 11:07:49 -07:00
sasha0552	c54269d967	[Frontend] Add tokenize/detokenize endpoints (#5054 )	2024-06-26 16:54:22 +00:00
Luka Govedič	5bfd1bbc98	[Kernel] Adding bias epilogue support for `cutlass_scaled_mm` (#5560 ) Co-authored-by: Chih-Chieh-Yang <7364402+cyang49@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>	2024-06-26 15:16:00 +00:00
Cyrus Leung	6984c02a27	[CI/Build] Refactor image test assets (#5821 )	2024-06-26 01:02:34 -07:00

... 2 3 4 5 6 ...

1910 Commits