Cyrus Leung
7015417fd4
[Bugfix] Add missing attributes in mistral tokenizer ( #8364 )
2024-09-11 11:36:54 -07:00
Alexey Kondratiev(AMD)
aea02f30de
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation ( #8373 )
2024-09-11 18:31:41 +00:00
Li, Jiang
0b952af458
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend ( #7257 )
2024-09-11 09:46:46 -07:00
Yang Fan
3b7fea770f
[Model][VLM] Add Qwen2-VL model support ( #7905 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-11 09:31:19 -07:00
Pooya Davoodi
cea95dfb94
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch ( #8347 )
2024-09-11 05:30:11 +00:00
Yangshen⚡Deng
6a512a00df
[model] Support for Llava-Next-Video model ( #7559 )
...
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
2024-09-10 22:21:36 -07:00
Pavani Majety
efcf946a15
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. ( #6112 )
2024-09-11 00:38:40 -04:00
Isotr0py
1230263e16
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel ( #8299 )
2024-09-11 10:11:01 +08:00
Jee Jee Li
e497b8aeff
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models ( #8329 )
2024-09-10 20:59:19 -04:00
Tyler Michael Smith
94144e726c
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag ( #8043 )
2024-09-10 23:51:58 +00:00
William Lin
1d5e397aa4
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers ( #8172 )
2024-09-10 23:46:08 +00:00
Alexander Matveev
22f3a4bc6c
[Bugfix] lookahead block table with cuda graph max capture ( #8340 )
...
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (#8340 )
2024-09-10 16:00:35 -07:00
Cody Yu
b1f3e18958
[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled ( #8342 )
2024-09-10 22:28:28 +00:00
Prashant Gupta
04e7c4e771
[Misc] remove peft as dependency for prompt models ( #8162 )
2024-09-10 17:21:56 -04:00
Kevin Lin
5faedf1b62
[Spec Decode] Move ops.advance_step to flash attn advance_step ( #8224 )
2024-09-10 13:18:14 -07:00
sumitd2
02751a7a42
Fix ppc64le buildkite job ( #8309 )
2024-09-10 12:58:34 -07:00
Alexey Kondratiev(AMD)
f421f3cefb
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail ( #8130 )
2024-09-10 11:51:15 -07:00
Cyrus Leung
8c054b7a62
[Frontend] Clean up type annotations for mistral tokenizer ( #8314 )
2024-09-10 16:49:11 +00:00
Daniele
6234385f4a
[CI/Build] enable ccache/scccache for HIP builds ( #8327 )
2024-09-10 08:55:08 -07:00
Cyrus Leung
da1a844e61
[Bugfix] Fix missing post_layernorm in CLIP ( #8155 )
2024-09-10 08:22:50 +00:00
Simon Mo
a1d874224d
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info ( #8319 )
2024-09-09 23:21:00 -07:00
Dipika Sikka
6cd5e5b07e
[Misc] Fused MoE Marlin support for GPTQ ( #8217 )
2024-09-09 23:02:52 -04:00
Kyle Sayers
c7cb5c3335
[Misc] GPTQ Activation Ordering ( #8135 )
2024-09-09 16:27:26 -04:00
Vladislav Kruglikov
f9b4a2d415
[Bugfix] Correct adapter usage for cohere and jamba ( #8292 )
2024-09-09 11:20:46 -07:00
Adam Lugowski
58fcc8545a
[Frontend] Add progress reporting to run_batch.py ( #8060 )
...
Co-authored-by: Adam Lugowski <adam.lugowski@parasail.io>
2024-09-09 11:16:37 -07:00
Kyle Mistele
08287ef675
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility ( #8272 )
2024-09-09 10:45:11 -04:00
Alexander Matveev
4ef41b8476
[Bugfix] Fix async postprocessor in case of preemption ( #8267 )
2024-09-07 21:01:51 -07:00
Joe Runde
cfe712bf1a
[CI/Build] Use python 3.12 in cuda image ( #8133 )
...
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-09-07 13:03:16 -07:00
sumitd2
b962ee1470
ppc64le: Dockerfile fixed, and a script for buildkite ( #8026 )
2024-09-07 11:18:40 -07:00
Isotr0py
36bf8150cc
[Model][VLM] Decouple weight loading logic for Paligemma ( #8269 )
2024-09-07 17:45:44 +00:00
Isotr0py
e807125936
[Model][VLM] Support multi-images inputs for InternVL2 models ( #8201 )
2024-09-07 16:38:23 +08:00
Cyrus Leung
9f68e00d27
[Bugfix] Fix broken OpenAI tensorizer test ( #8258 )
2024-09-07 08:02:39 +00:00
youkaichao
ce2702a923
[tpu][misc] fix typo ( #8260 )
2024-09-06 22:40:46 -07:00
Wei-Sheng Chin
795b662cff
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) ( #8241 )
2024-09-06 20:18:16 -07:00
Cyrus Leung
2f707fcb35
[Model] Multi-input support for LLaVA ( #8238 )
2024-09-07 02:57:24 +00:00
Kyle Mistele
41e95c5247
[Bugfix] Fix Hermes tool call chat template bug ( #8256 )
...
Co-authored-by: Kyle Mistele <kyle@constellate.ai>
2024-09-07 10:49:01 +08:00
William Lin
12dd715807
[misc] [doc] [frontend] LLM torch profiler support ( #7943 )
2024-09-06 17:48:48 -07:00
Patrick von Platen
29f49cd6e3
[Model] Allow loading from original Mistral format ( #8168 )
...
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-09-06 17:02:05 -06:00
Dipika Sikka
23f322297f
[Misc] Remove SqueezeLLM ( #8220 )
2024-09-06 16:29:03 -06:00
rasmith
9db52eab3d
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput ( #8248 )
2024-09-06 16:26:09 -06:00
Alexey Kondratiev(AMD)
1447c97e75
[CI/Build] Increasing timeout for multiproc worker tests ( #8203 )
2024-09-06 11:51:03 -07:00
Rui Qiao
de80783b69
[Misc] Use ray[adag] dependency instead of cuda ( #7938 )
2024-09-06 09:18:35 -07:00
afeldman-nm
e5cab71531
[Frontend] Add --logprobs argument to benchmark_serving.py ( #8191 )
2024-09-06 09:01:14 -07:00
Nick Hill
baa5467547
[BugFix] Fix Granite model configuration ( #8216 )
2024-09-06 11:39:29 +08:00
Jiaxin Shan
db3bf7c991
[Core] Support load and unload LoRA in api server ( #6566 )
...
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
2024-09-05 18:10:33 -07:00
sroy745
2febcf2777
[Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM ( #7962 )
2024-09-05 16:25:29 -04:00
Michael Goin
2ee45281a5
Move verify_marlin_supported to GPTQMarlinLinearMethod ( #8165 )
2024-09-05 11:09:46 -04:00
Alex Brooks
9da25a88aa
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) ( #8029 )
...
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
2024-09-05 12:48:10 +00:00
manikandan.tm@zucisystems.com
8685ba1a1e
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) ( #7860 )
2024-09-05 11:33:37 +00:00
Cyrus Leung
288a938872
[Doc] Indicate more information about supported modalities ( #8181 )
2024-09-05 10:51:53 +00:00