Commit Graph

2666 Commits

Author SHA1 Message Date
Cody Yu
9606c7197d
Revert #7509 (#7887) 2024-08-27 00:16:31 -07:00
youkaichao
64cc644425
[core][torch.compile] discard the compile for profiling (#7796) 2024-08-26 21:33:58 -07:00
Nick Hill
39178c7fbc
[Tests] Disable retries and use context manager for openai client (#7565) 2024-08-26 21:33:17 -07:00
Megha Agarwal
2eedede875
[Core] Asynchronous Output Processor (#7049)
Co-authored-by: Alexander Matveev <alexm@neuralmagic.com>
2024-08-26 20:53:20 -07:00
Dipika Sikka
015e6cc252
[Misc] Update compressed tensors lifecycle to remove prefix from create_weights (#7825) 2024-08-26 18:09:34 -06:00
omrishiv
760e9f71a8
[Bugfix] neuron: enable tensor parallelism (#7562)
Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>
2024-08-26 15:13:13 -07:00
youkaichao
05826c887b
[misc] fix custom allreduce p2p cache file generation (#7853) 2024-08-26 15:02:25 -07:00
Dipika Sikka
dd9857f5fa
[Misc] Update gptq_marlin_24 to use vLLMParameters (#7762)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-26 17:44:54 -04:00
Dipika Sikka
665304092d
[Misc] Update qqq to use vLLMParameters (#7805) 2024-08-26 13:16:15 -06:00
Cody Yu
2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822) 2024-08-26 11:24:53 -07:00
Cyrus Leung
029c71de11
[CI/Build] Avoid downloading all HF files in RemoteOpenAIServer (#7836) 2024-08-26 05:31:10 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟
0b769992ec
[Bugfix]: Use float32 for base64 embedding (#7855)
Signed-off-by: Hollow Man <hollowman@opensuse.org>
2024-08-26 03:16:38 +00:00
Nick Hill
1856aff4d6
[Spec Decoding] Streamline batch expansion tensor manipulation (#7851) 2024-08-25 15:45:14 -07:00
youkaichao
70c094ade6
[misc][cuda] improve pynvml warning (#7852) 2024-08-25 14:30:09 -07:00
Isotr0py
2059b8d9ca
[Misc] Remove snapshot_download usage in InternVL2 test (#7835) 2024-08-25 15:53:09 +00:00
Isotr0py
8aaf3d5347
[Model][VLM] Support multi-images inputs for Phi-3-vision models (#7783) 2024-08-25 11:51:20 +00:00
zifeitong
80162c44b1
[Bugfix] Fix Phi-3v crash when input images are of certain sizes (#7840) 2024-08-24 18:16:24 -07:00
youkaichao
aab0fcdb63
[ci][test] fix RemoteOpenAIServer (#7838) 2024-08-24 17:31:28 +00:00
youkaichao
ea9fa160e3
[ci][test] exclude model download time in server start time (#7834) 2024-08-24 01:03:27 -07:00
youkaichao
7d9ffa2ae1
[misc][core] lazy import outlines (#7831) 2024-08-24 00:51:38 -07:00
Tyler Rockwood
d81abefd2e
[Frontend] add json_schema support from OpenAI protocol (#7654) 2024-08-23 23:07:24 -07:00
Pooya Davoodi
8da48e4d95
[Frontend] Publish Prometheus metrics in run_batch API (#7641) 2024-08-23 23:04:22 -07:00
Pooya Davoodi
6885fde317
[Bugfix] Fix run_batch logger (#7640) 2024-08-23 13:58:26 -07:00
Alexander Matveev
9db93de20c
[Core] Add multi-step support to LLMEngine (#7789) 2024-08-23 12:45:53 -07:00
Simon Mo
09c7792610
Bump version to v0.5.5 (#7823) 2024-08-23 11:35:33 -07:00
Dipika Sikka
f1df5dbfd6
[Misc] Update marlin to use vLLMParameters (#7803) 2024-08-23 14:30:52 -04:00
youkaichao
35ee2ad6b9
[github][misc] promote asking llm first (#7809) 2024-08-23 09:38:50 -07:00
Maximilien de Bayser
e25fee57c2
[BugFix] Fix server crash on empty prompt (#7746)
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
2024-08-23 13:12:44 +00:00
Jie Fu (傅杰)
faeddb565d
[misc] Add Torch profiler support for CPU-only devices (#7806) 2024-08-23 05:46:25 +00:00
Kunshang Ji
fc5ebbd1d3
[Hardware][Intel GPU] refactor xpu_model_runner for tp (#7712) 2024-08-22 20:06:54 -07:00
SangBin Cho
c01a6cb231
[Ray backend] Better error when pg topology is bad. (#7584)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-22 17:44:25 -07:00
Joe Runde
b903e1ba7f
[Frontend] error suppression cleanup (#7786)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 21:50:21 +00:00
Siyuan Liu
a152246428
[Misc] fix typo in triton import warning (#7794) 2024-08-22 13:51:23 -07:00
Kevin H. Luu
666ad0aa16
[ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args (#7705)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-22 20:10:55 +00:00
Michael Goin
15310b5101
[Bugfix] Use LoadFormat values for vllm serve --load-format (#7784) 2024-08-22 11:37:08 -07:00
Peter Salas
57792ed469
[Doc] Fix incorrect docs from #7615 (#7788) 2024-08-22 10:02:06 -07:00
Jiaxin Shan
d3b5b98021
[Misc] Enhance prefix-caching benchmark tool (#6568) 2024-08-22 09:32:02 -07:00
Travis Johnson
cc0eaf12b1
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-22 09:33:48 -04:00
Dipika Sikka
955b5191c9
[Misc] update fp8 to use vLLMParameter (#7437) 2024-08-22 08:36:18 -04:00
Lucas Wilkinson
55d63b1211
[Bugfix] Don't build machete on cuda <12.0 (#7757) 2024-08-22 08:28:52 -04:00
Flex Wang
4f419c00a6
Fix ShardedStateLoader for vllm fp8 quantization (#7708) 2024-08-22 08:25:04 -04:00
Abhinav Goyal
a3fce56b88
[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830) 2024-08-22 02:42:24 -07:00
Woosuk Kwon
b3856bef7d
[Misc] Use torch.compile for GemmaRMSNorm (#7642) 2024-08-22 01:14:13 -07:00
youkaichao
8c6f694a79
[ci] refine dependency for distributed tests (#7776) 2024-08-22 00:54:15 -07:00
Woosuk Kwon
eeee1c3b1a
[TPU] Avoid initializing TPU runtime in is_tpu (#7763) 2024-08-21 21:31:49 -07:00
Michael Goin
aae74ef95c
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" (#7764) 2024-08-22 03:42:14 +00:00
Joe Runde
cde9183b40
[Bug][Frontend] Improve ZMQ client robustness (#7443)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-22 02:18:11 +00:00
zifeitong
df1a21131d
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue (#7710) 2024-08-22 09:36:24 +08:00
Luka Govedič
7937009a7e
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce (#7233)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-08-21 20:18:00 -04:00
Gregory Shtrasberg
9984605412
[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility (#7477)
Co-authored-by: Charlie Fu <Charlie.Fu@amd.com>
2024-08-21 16:47:36 -07:00