Avshalom Manevich
|
12a59959ed
|
[Bugfix] adding chunking mechanism to fused_moe to handle large inputs (#6029)
|
2024-07-01 21:08:29 +00:00 |
|
Antoni Baum
|
dec6fc6f3b
|
[Bugfix] Use RayActorError for older versions of Ray in RayTokenizerGroupPool (#6039)
|
2024-07-01 20:12:40 +00:00 |
|
youkaichao
|
8893130b63
|
[doc][misc] further lower visibility of simple api server (#6041)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-07-01 10:50:56 -07:00 |
|
zhyncs
|
bb60326836
|
[Misc] update benchmark backend for scalellm (#6018)
|
2024-07-01 10:20:33 -07:00 |
|
youkaichao
|
4050d646e5
|
[doc][misc] remove deprecated api server in doc (#6037)
|
2024-07-01 12:52:43 -04:00 |
|
Robert Shaw
|
d76084c12f
|
[ CI ] Re-enable Large Model LM Eval (#6031)
|
2024-07-01 12:40:45 -04:00 |
|
sroy745
|
80ca1e6a3a
|
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348)
|
2024-07-01 00:33:05 -07:00 |
|
youkaichao
|
614aa51203
|
[misc][cuda] use nvml to avoid accidentally cuda initialization (#6007)
|
2024-06-30 20:07:34 -07:00 |
|
Robert Shaw
|
af9ad46fca
|
[ Misc ] Refactor w8a8 to use process_weights_after_load (Simplify Weight Loading) (#5940)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-30 23:06:27 +00:00 |
|
Dipika Sikka
|
7836fdcc11
|
[Misc] Fix get_min_capability (#5971)
|
2024-06-30 20:15:16 +00:00 |
|
Robert Shaw
|
deacb7ec44
|
[ CI ] Temporarily Disable Large LM-Eval Tests (#6005)
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic>
|
2024-06-30 11:56:56 -07:00 |
|
SangBin Cho
|
f5e73c9f1b
|
[Lora] Use safetensor keys instead of adapter_config.json to find unexpected modules. (#5909)
Co-authored-by: sang <sangcho@anyscale.com>
|
2024-06-30 17:11:15 +00:00 |
|
llmpros
|
c6c240aa0a
|
[Frontend]: Support base64 embedding (#5935)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-06-30 23:53:00 +08:00 |
|
youkaichao
|
2be6955a3f
|
[ci][distributed] fix device count call
[ci][distributed] fix some cuda init that makes it necessary to use spawn (#5991)
|
2024-06-30 08:06:13 +00:00 |
|
Cyrus Leung
|
9d47f64eb6
|
[CI/Build] [3/3] Reorganize entrypoints tests (#5966)
|
2024-06-30 12:58:49 +08:00 |
|
Cyrus Leung
|
cff6a1fec1
|
[CI/Build] Reuse code for checking output consistency (#5988)
|
2024-06-30 11:44:25 +08:00 |
|
Roger Wang
|
bcc6a09b63
|
[CI/Build] Temporarily Remove Phi3-Vision from TP Test (#5989)
|
2024-06-30 09:18:31 +08:00 |
|
Matt Wong
|
9def10664e
|
[Bugfix][CI/Build][Hardware][AMD] Install matching torchvision to fix AMD tests (#5949)
|
2024-06-29 12:47:58 -07:00 |
|
Robert Shaw
|
75aa1442db
|
[ CI/Build ] LM Eval Harness Based CI Testing (#5838)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-29 13:04:30 -04:00 |
|
Cyrus Leung
|
99397da534
|
[CI/Build] Add TP test for vision models (#5892)
|
2024-06-29 15:45:54 +00:00 |
|
Robert Shaw
|
8dbfcd35bf
|
[ CI/Build ] Added E2E Test For Compressed Tensors (#5839)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-29 21:12:58 +08:00 |
|
Cody Yu
|
f7dac83d95
|
[Kernel] Raise an exception in MoE kernel if the batch size is larger then 65k (#5939)
|
2024-06-29 21:04:20 +08:00 |
|
Antoni Baum
|
7c01f70641
|
[Core] Optimize SequenceStatus.is_finished by switching to IntEnum (#5974)
|
2024-06-29 12:47:53 +00:00 |
|
Cyrus Leung
|
51e971d39e
|
[Bugfix] Support eos_token_id from config.json (#5954)
|
2024-06-29 11:19:02 +00:00 |
|
Roger Wang
|
329df38f1a
|
[Misc] Update Phi-3-Vision Example (#5981)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-06-29 14:34:29 +08:00 |
|
Woosuk Kwon
|
580353da93
|
[Bugfix] Fix precisions in Gemma 1 (#5913)
|
2024-06-29 03:10:21 +00:00 |
|
Joe Runde
|
ba4994443a
|
[Kernel] Add punica dimensions for Granite 3b and 8b (#5930)
Signed-off-by: Joe Runde <joe@joerun.de>
|
2024-06-29 10:48:25 +08:00 |
|
William Lin
|
906a19cdb0
|
[Misc] Extend vLLM Metrics logging API (#5925)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2024-06-29 10:36:06 +08:00 |
|
mcalman
|
c4bca740e8
|
[Bugfix] fix missing last itl in openai completions benchmark (#5926)
|
2024-06-29 10:34:42 +08:00 |
|
Woosuk Kwon
|
7f83f40dee
|
[Bugfix][TPU] Fix pad slot id (#5977)
|
2024-06-28 18:55:17 -07:00 |
|
Woosuk Kwon
|
54814fd85b
|
[Bugfix][TPU] Fix TPU sampler output (#5978)
|
2024-06-28 18:14:16 -07:00 |
|
Lily Liu
|
7041de4384
|
[Kernel] Flashinfer for prefill & decode, with Cudagraph support for decode (#4628)
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>, bong-furiosa <bongwon.jang@furiosa.ai>
|
2024-06-28 15:28:49 -07:00 |
|
Robert Shaw
|
6a62cb82cc
|
[Bugfix] Fix Engine Failing After Invalid Request - AsyncEngineDeadError (#5963)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-28 17:46:30 -04:00 |
|
Tyler Michael Smith
|
5d2a1a9cf0
|
Unmark more files as executable (#5962)
|
2024-06-28 17:34:56 -04:00 |
|
Michael Goin
|
4bf35ed9ae
|
[Bugfix] Only add Attention.kv_scale if kv cache quantization is enabled (#5936)
|
2024-06-28 21:12:40 +00:00 |
|
wangding zeng
|
be0b3af9e0
|
Support Deepseek-V2 (#4650)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
|
2024-06-28 13:24:57 -07:00 |
|
Robert Shaw
|
2cd402e169
|
[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 (#5921)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-28 18:43:49 +00:00 |
|
Robert Shaw
|
b185230744
|
[ Misc ] Remove fp8_shard_indexer from Col/Row Parallel Linear (Simplify Weight Loading) (#5928)
Co-authored-by: Robert Shaw <rshaw@neuralmagic>
|
2024-06-28 13:49:57 -04:00 |
|
Tyler Michael Smith
|
6a2d659d28
|
[Bugfix] Fix compute datatype for cutlass 3.x epilogues (#5931)
|
2024-06-28 17:10:34 +00:00 |
|
Cody Yu
|
b2c620230a
|
[Spec Decode] Introduce DraftModelRunner (#5799)
|
2024-06-28 09:17:51 -07:00 |
|
xwjiang2010
|
b90d8cd832
|
[Distributed] Make it clear that % should not be in tensor dict keys. (#5927)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
|
2024-06-28 15:20:22 +00:00 |
|
Cyrus Leung
|
3b752a6555
|
[CI/Build] [2/3] Reorganize entrypoints tests (#5904)
|
2024-06-28 07:59:18 -07:00 |
|
Thomas Parnell
|
ec1ad0046c
|
[Bugfix] Better error message for MLPSpeculator when num_speculative_tokens is set too high (#5894)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
|
2024-06-28 07:42:17 -07:00 |
|
Ilya Lavrenov
|
57f09a419c
|
[Hardware][Intel] OpenVINO vLLM backend (#5379)
|
2024-06-28 13:50:16 +00:00 |
|
Tyler Michael Smith
|
5932634409
|
Unmark fused_moe config json file as executable (#5960)
|
2024-06-28 06:36:12 -07:00 |
|
Cyrus Leung
|
5cbe8d155c
|
[Core] Registry for processing model inputs (#5214)
Co-authored-by: ywang96 <ywang@roblox.com>
|
2024-06-28 12:09:56 +00:00 |
|
Isotr0py
|
0d0e3a42ac
|
[Bugfix][Hardware][Intel CPU] Fix unpassed multi_modal_kwargs for CPU runner (#5956)
|
2024-06-28 12:03:41 +00:00 |
|
xwjiang2010
|
74d55c065b
|
[VLM][BugFix] Make sure that multi_modal_kwargs can broadcast properly with ring buffer. (#5905)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-06-28 07:29:13 +00:00 |
|
Woosuk Kwon
|
f136da15e1
|
[Hardware][TPU] Optimize KV cache swapping (#5878)
|
2024-06-27 21:12:13 -07:00 |
|
Divakar Verma
|
c3dde367f1
|
[Kernel][ROCm][AMD] fused_moe Triton configs v2 for mi300X (#5932)
|
2024-06-27 13:41:08 -07:00 |
|