Robert Shaw
4d26d806e1
Update conftest.py ( #6076 )
2024-07-02 20:14:22 +00:00
Murali Andoorveedu
c5832d2ae9
[Core] Pipeline Parallel Support ( #4412 )
...
Signed-off-by: Muralidhar Andoorveedu <muralidhar.andoorveedu@centml.ai>
2024-07-02 10:58:08 -07:00
Sirej Dua
15aba081f3
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) ( #6050 )
...
Co-authored-by: Sirej Dua <sirej.dua@databricks.com>
Co-authored-by: Sirej Dua <Sirej Dua>
2024-07-02 07:20:29 -07:00
Cyrus Leung
31354e563f
[Doc] Reinstate doc dependencies ( #6061 )
2024-07-02 10:53:16 +00:00
xwjiang2010
98d6682cd1
[VLM] Remove image_input_type from VLM config ( #5852 )
...
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-02 07:57:09 +00:00
danieljannai21
2c37540aa6
[Frontend] Add template related params to request ( #5709 )
2024-07-01 23:01:57 -07:00
Alexander Matveev
3476ed0809
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) ( #5602 )
2024-07-01 20:10:37 -07:00
Thomas Parnell
54600709b6
[Model] Changes to MLPSpeculator to support tie_weights and input_scale ( #5965 )
...
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Joshua Rosenkranz <jmrosenk@us.ibm.com>
2024-07-01 16:40:02 -07:00
James Whedbee
e373853e12
[Frontend] Relax api url assertion for openai benchmarking ( #6046 )
2024-07-01 23:39:10 +00:00
Nick Hill
c87ebc3ef9
[BugFix] Ensure worker model loop is always stopped at the right time ( #5987 )
2024-07-01 16:17:58 -07:00
Antoni Baum
c4059ea54f
[Bugfix] Add explicit end_forward calls to flashinfer ( #6044 )
2024-07-01 23:08:58 +00:00
Roger Wang
8e0817c262
[Bugfix][Doc] Fix Doc Formatting ( #6048 )
2024-07-01 15:09:11 -07:00
ning.zhang
83bdcb6ac3
add FAQ doc under 'serving' ( #5946 )
2024-07-01 14:11:36 -07:00
Avshalom Manevich
12a59959ed
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs ( #6029 )
2024-07-01 21:08:29 +00:00
Antoni Baum
dec6fc6f3b
[Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool ( #6039 )
2024-07-01 20:12:40 +00:00
youkaichao
8893130b63
[doc][misc] further lower visibility of simple api server ( #6041 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-07-01 10:50:56 -07:00
zhyncs
bb60326836
[Misc] update benchmark backend for scalellm ( #6018 )
2024-07-01 10:20:33 -07:00
youkaichao
4050d646e5
[doc][misc] remove deprecated api server in doc ( #6037 )
2024-07-01 12:52:43 -04:00
Robert Shaw
d76084c12f
[ CI ] Re-enable Large Model LM Eval ( #6031 )
2024-07-01 12:40:45 -04:00
sroy745
80ca1e6a3a
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker ( #5348 )
2024-07-01 00:33:05 -07:00
youkaichao
614aa51203
[misc][cuda] use nvml to avoid accidentally cuda initialization ( #6007 )
2024-06-30 20:07:34 -07:00
Robert Shaw
af9ad46fca
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) ( #5940 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-30 23:06:27 +00:00
Dipika Sikka
7836fdcc11
[Misc] Fix get_min_capability ( #5971 )
2024-06-30 20:15:16 +00:00
Robert Shaw
deacb7ec44
[ CI ] Temporarily Disable Large LM-Eval Tests ( #6005 )
...
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic>
2024-06-30 11:56:56 -07:00
SangBin Cho
f5e73c9f1b
[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. ( #5909 )
...
Co-authored-by: sang <sangcho@anyscale.com>
2024-06-30 17:11:15 +00:00
llmpros
c6c240aa0a
[Frontend]: Support base64 embedding ( #5935 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-06-30 23:53:00 +08:00
youkaichao
2be6955a3f
[ci][distributed] fix device count call
...
[ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991 )
2024-06-30 08:06:13 +00:00
Cyrus Leung
9d47f64eb6
[CI/Build] [3/3] Reorganize entrypoints tests ( #5966 )
2024-06-30 12:58:49 +08:00
Cyrus Leung
cff6a1fec1
[CI/Build] Reuse code for checking output consistency ( #5988 )
2024-06-30 11:44:25 +08:00
Roger Wang
bcc6a09b63
[CI/Build] Temporarily Remove Phi3-Vision from TP Test ( #5989 )
2024-06-30 09:18:31 +08:00
Matt Wong
9def10664e
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests ( #5949 )
2024-06-29 12:47:58 -07:00
Robert Shaw
75aa1442db
[ CI/Build ] LM Eval Harness Based CI Testing ( #5838 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-29 13:04:30 -04:00
Cyrus Leung
99397da534
[CI/Build] Add TP test for vision models ( #5892 )
2024-06-29 15:45:54 +00:00
Robert Shaw
8dbfcd35bf
[ CI/Build ] Added E2E Test For Compressed Tensors ( #5839 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-29 21:12:58 +08:00
Cody Yu
f7dac83d95
[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k ( #5939 )
2024-06-29 21:04:20 +08:00
Antoni Baum
7c01f70641
[Core] Optimize SequenceStatus.is_finished by switching to IntEnum ( #5974 )
2024-06-29 12:47:53 +00:00
Cyrus Leung
51e971d39e
[Bugfix] Support eos_token_id from config.json ( #5954 )
2024-06-29 11:19:02 +00:00
Roger Wang
329df38f1a
[Misc] Update Phi-3-Vision Example ( #5981 )
...
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-06-29 14:34:29 +08:00
Woosuk Kwon
580353da93
[Bugfix] Fix precisions in Gemma 1 ( #5913 )
2024-06-29 03:10:21 +00:00
Joe Runde
ba4994443a
[Kernel] Add punica dimensions for Granite 3b and 8b ( #5930 )
...
Signed-off-by: Joe Runde <joe@joerun.de>
2024-06-29 10:48:25 +08:00
William Lin
906a19cdb0
[Misc] Extend vLLM Metrics logging API ( #5925 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-29 10:36:06 +08:00
mcalman
c4bca740e8
[Bugfix] fix missing last itl in openai completions benchmark ( #5926 )
2024-06-29 10:34:42 +08:00
Woosuk Kwon
7f83f40dee
[Bugfix][TPU] Fix pad slot id ( #5977 )
2024-06-28 18:55:17 -07:00
Woosuk Kwon
54814fd85b
[Bugfix][TPU] Fix TPU sampler output ( #5978 )
2024-06-28 18:14:16 -07:00
Lily Liu
7041de4384
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode ( #4628 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>
2024-06-28 15:28:49 -07:00
Robert Shaw
6a62cb82cc
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError ( #5963 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 17:46:30 -04:00
Tyler Michael Smith
5d2a1a9cf0
Unmark more files as executable ( #5962 )
2024-06-28 17:34:56 -04:00
Michael Goin
4bf35ed9ae
[Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled ( #5936 )
2024-06-28 21:12:40 +00:00
wangding zeng
be0b3af9e0
Support Deepseek-V2 ( #4650 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2024-06-28 13:24:57 -07:00
Robert Shaw
2cd402e169
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 ( #5921 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
2024-06-28 18:43:49 +00:00