squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Robert Shaw	4d26d806e1	Update conftest.py (#6076 )	2024-07-02 20:14:22 +00:00
Murali Andoorveedu	c5832d2ae9	[Core] Pipeline Parallel Support (#4412 ) Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>	2024-07-02 10:58:08 -07:00
Sirej Dua	15aba081f3	[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050 ) Co-authored-by: Sirej Dua <sirej.dua@databricks.com> Co-authored-by: Sirej Dua <Sirej Dua>	2024-07-02 07:20:29 -07:00
Cyrus Leung	31354e563f	[Doc] Reinstate doc dependencies (#6061 )	2024-07-02 10:53:16 +00:00
xwjiang2010	98d6682cd1	[VLM] Remove `image_input_type` from VLM config (#5852 ) Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>	2024-07-02 07:57:09 +00:00
danieljannai21	2c37540aa6	[Frontend] Add template related params to request (#5709 )	2024-07-01 23:01:57 -07:00
Alexander Matveev	3476ed0809	[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602 )	2024-07-01 20:10:37 -07:00
Thomas Parnell	54600709b6	[Model] Changes to MLPSpeculator to support tie_weights and input_scale (#5965 ) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Joshua Rosenkranz <jmrosenk@us.ibm.com>	2024-07-01 16:40:02 -07:00
James Whedbee	e373853e12	[Frontend] Relax api url assertion for openai benchmarking (#6046 )	2024-07-01 23:39:10 +00:00
Nick Hill	c87ebc3ef9	[BugFix] Ensure worker model loop is always stopped at the right time (#5987 )	2024-07-01 16:17:58 -07:00
Antoni Baum	c4059ea54f	[Bugfix] Add explicit `end_forward` calls to flashinfer (#6044 )	2024-07-01 23:08:58 +00:00
Roger Wang	8e0817c262	[Bugfix][Doc] Fix Doc Formatting (#6048 )	2024-07-01 15:09:11 -07:00
ning.zhang	83bdcb6ac3	add FAQ doc under 'serving' (#5946 )	2024-07-01 14:11:36 -07:00
Avshalom Manevich	12a59959ed	[Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029 )	2024-07-01 21:08:29 +00:00
Antoni Baum	dec6fc6f3b	[Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool (#6039 )	2024-07-01 20:12:40 +00:00
youkaichao	8893130b63	[doc][misc] further lower visibility of simple api server (#6041 ) Co-authored-by: Simon Mo <simon.mo@hey.com>	2024-07-01 10:50:56 -07:00
zhyncs	bb60326836	[Misc] update benchmark backend for scalellm (#6018 )	2024-07-01 10:20:33 -07:00
youkaichao	4050d646e5	[doc][misc] remove deprecated api server in doc (#6037 )	2024-07-01 12:52:43 -04:00
Robert Shaw	d76084c12f	[ CI ] Re-enable Large Model LM Eval (#6031 )	2024-07-01 12:40:45 -04:00
sroy745	80ca1e6a3a	[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348 )	2024-07-01 00:33:05 -07:00
youkaichao	614aa51203	[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007 )	2024-06-30 20:07:34 -07:00
Robert Shaw	af9ad46fca	[ Misc ] Refactor w8a8 to use `process_weights_after_load` (Simplify Weight Loading) (#5940 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-30 23:06:27 +00:00
Dipika Sikka	7836fdcc11	[Misc] Fix `get_min_capability` (#5971 )	2024-06-30 20:15:16 +00:00
Robert Shaw	deacb7ec44	[ CI ] Temporarily Disable Large LM-Eval Tests (#6005 ) Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic>	2024-06-30 11:56:56 -07:00
SangBin Cho	f5e73c9f1b	[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909 ) Co-authored-by: sang <sangcho@anyscale.com>	2024-06-30 17:11:15 +00:00
llmpros	c6c240aa0a	[Frontend]: Support base64 embedding (#5935 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-06-30 23:53:00 +08:00
youkaichao	2be6955a3f	[ci][distributed] fix device count call [ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991)	2024-06-30 08:06:13 +00:00
Cyrus Leung	9d47f64eb6	[CI/Build] [3/3] Reorganize entrypoints tests (#5966 )	2024-06-30 12:58:49 +08:00
Cyrus Leung	cff6a1fec1	[CI/Build] Reuse code for checking output consistency (#5988 )	2024-06-30 11:44:25 +08:00
Roger Wang	bcc6a09b63	[CI/Build] Temporarily Remove Phi3-Vision from TP Test (#5989 )	2024-06-30 09:18:31 +08:00
Matt Wong	9def10664e	[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (#5949 )	2024-06-29 12:47:58 -07:00
Robert Shaw	75aa1442db	[ CI/Build ] LM Eval Harness Based CI Testing (#5838 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-29 13:04:30 -04:00
Cyrus Leung	99397da534	[CI/Build] Add TP test for vision models (#5892 )	2024-06-29 15:45:54 +00:00
Robert Shaw	8dbfcd35bf	[ CI/Build ] Added E2E Test For Compressed Tensors (#5839 ) Co-authored-by: Michael Goin <michael@neuralmagic.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-29 21:12:58 +08:00
Cody Yu	f7dac83d95	[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (#5939 )	2024-06-29 21:04:20 +08:00
Antoni Baum	7c01f70641	[Core] Optimize `SequenceStatus.is_finished` by switching to IntEnum (#5974 )	2024-06-29 12:47:53 +00:00
Cyrus Leung	51e971d39e	[Bugfix] Support `eos_token_id` from `config.json` (#5954 )	2024-06-29 11:19:02 +00:00
Roger Wang	329df38f1a	[Misc] Update Phi-3-Vision Example (#5981 ) Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>	2024-06-29 14:34:29 +08:00
Woosuk Kwon	580353da93	[Bugfix] Fix precisions in Gemma 1 (#5913 )	2024-06-29 03:10:21 +00:00
Joe Runde	ba4994443a	[Kernel] Add punica dimensions for Granite 3b and 8b (#5930 ) Signed-off-by: Joe Runde <joe@joerun.de>	2024-06-29 10:48:25 +08:00
William Lin	906a19cdb0	[Misc] Extend vLLM Metrics logging API (#5925 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2024-06-29 10:36:06 +08:00
mcalman	c4bca740e8	[Bugfix] fix missing last itl in openai completions benchmark (#5926 )	2024-06-29 10:34:42 +08:00
Woosuk Kwon	7f83f40dee	[Bugfix][TPU] Fix pad slot id (#5977 )	2024-06-28 18:55:17 -07:00
Woosuk Kwon	54814fd85b	[Bugfix][TPU] Fix TPU sampler output (#5978 )	2024-06-28 18:14:16 -07:00
Lily Liu	7041de4384	[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628 ) Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>	2024-06-28 15:28:49 -07:00
Robert Shaw	6a62cb82cc	[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (#5963 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-28 17:46:30 -04:00
Tyler Michael Smith	5d2a1a9cf0	Unmark more files as executable (#5962 )	2024-06-28 17:34:56 -04:00
Michael Goin	4bf35ed9ae	[Bugfix] Only add `Attention.kv_scale` if kv cache quantization is enabled (#5936 )	2024-06-28 21:12:40 +00:00
wangding zeng	be0b3af9e0	Support Deepseek-V2 (#4650 ) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>	2024-06-28 13:24:57 -07:00
Robert Shaw	2cd402e169	[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921 ) Co-authored-by: Robert Shaw <rshaw@neuralmagic>	2024-06-28 18:43:49 +00:00

1 2 3 4 5 ...

1794 Commits