Commit Graph

2303 Commits

Author SHA1 Message Date
Dipika Sikka
b1e5afc3e7
[Misc] Update awq and awq_marlin to use vLLMParameters (#7422) 2024-08-13 17:08:20 -04:00
Dipika Sikka
d3bdfd3ab9
[Misc] Update Fused MoE weight loading (#7334) 2024-08-13 14:57:45 -04:00
Dipika Sikka
fb377d7e74
[Misc] Update gptq_marlin to use new vLLMParameters (#7281) 2024-08-13 14:30:11 -04:00
Dipika Sikka
181abbc27d
[Misc] Update LM Eval Tolerance (#7473) 2024-08-13 14:28:14 -04:00
Peter Salas
00c3d68e45
[Frontend][Core] Add plumbing to support audio language models (#7446) 2024-08-13 17:39:33 +00:00
Woosuk Kwon
e20233d361
Revert "[Doc] Update supported_hardware.rst (#7276)" (#7467) 2024-08-13 01:37:08 -07:00
Woosuk Kwon
d6e634f3d7
[TPU] Suppress import custom_ops warning (#7458) 2024-08-13 00:30:30 -07:00
youkaichao
4d2dc5072b
[hardware] unify usage of is_tpu to current_platform.is_tpu() (#7102) 2024-08-13 00:16:42 -07:00
Cyrus Leung
7025b11d94
[Bugfix] Fix weight loading for Chameleon when TP>1 (#7410) 2024-08-13 05:33:41 +00:00
Kevin H. Luu
5469146bcc
[ci] Remove fast check cancel workflow (#7455) 2024-08-12 21:19:51 -07:00
Andrew Wang
97a6be95ba
[Misc] improve logits processors logging message (#7435) 2024-08-13 02:29:34 +00:00
Cyrus Leung
9ba85bc152
[mypy] Misc. typing improvements (#7417) 2024-08-13 09:20:20 +08:00
Rui Qiao
198d6a2898
[Core] Shut down aDAG workers with clean async llm engine exit (#7224)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-12 17:57:16 -07:00
Daniele
774cd1d3bf
[CI/Build] bump minimum cmake version (#6999) 2024-08-12 16:29:20 -07:00
sasha0552
91294d56e1
[Bugfix] Handle PackageNotFoundError when checking for xpu version (#7398) 2024-08-12 16:07:20 -07:00
jon-chuang
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 22:47:41 +00:00
Cyrus Leung
4ddc4743d7
[Core] Consolidate GB constant and enable float GB arguments (#7416) 2024-08-12 14:14:14 -07:00
Lucas Wilkinson
6aa33cb2dd
[Misc] Use scalar type to dispatch to different gptq_marlin kernels (#7323) 2024-08-12 14:40:13 -04:00
Kevin H. Luu
1137f343aa
[ci] Cancel fastcheck when PR is ready (#7433)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-12 10:59:14 -07:00
Kevin H. Luu
9b3e2edd30
[ci] Cancel fastcheck run when PR is marked ready (#7427)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-12 10:56:52 -07:00
Kevin H. Luu
65950e8f58
[ci] Entrypoints run upon changes in vllm/ (#7423)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-12 10:18:03 -07:00
Woosuk Kwon
cfba4def5d
[Bugfix] Fix logit soft cap in flash-attn backend (#7425) 2024-08-12 09:58:28 -07:00
Daniele
d2bc4510a4
[CI/Build] bump Dockerfile.neuron image base, use public ECR (#6832) 2024-08-12 09:53:35 -07:00
Cyrus Leung
24154f8618
[Frontend] Disallow passing model as both argument and option (#7347) 2024-08-12 12:58:34 +00:00
Roger Wang
e6e42e4b17
[Core][VLM] Support image embeddings as input (#6613) 2024-08-12 16:16:06 +08:00
Lily Liu
ec2affa8ae
[Kernel] Flashinfer correctness fix for v0.1.3 (#7319) 2024-08-12 07:59:17 +00:00
Roger Wang
86ab567bae
[CI/Build] Minor refactoring for vLLM assets (#7407) 2024-08-12 02:41:52 +00:00
Simon Mo
f020a6297e
[Docs] Update readme (#7316) 2024-08-11 17:13:37 -07:00
youkaichao
6c8e595710
[misc] add commit id in collect env (#7405) 2024-08-11 15:40:48 -07:00
tomeras91
02b1988b9f
[Doc] building vLLM with VLLM_TARGET_DEVICE=empty (#7403) 2024-08-11 14:38:17 -07:00
tomeras91
386087970a
[CI/Build] build on empty device for better dev experience (#4773) 2024-08-11 13:09:44 -07:00
William Lin
c08e2b3086
[core] [2/N] refactor worker_base input preparation for multi-step (#7387) 2024-08-11 08:50:08 -07:00
Noam Gat
4fb7b52a2c
Updating LM Format Enforcer version to v0.10.6 (#7189) 2024-08-11 08:11:50 -04:00
Woosuk Kwon
90bab18f24
[TPU] Use mark_dynamic to reduce compilation time (#7340) 2024-08-10 18:12:22 -07:00
Isotr0py
4c5d8e8ea9
[Bugfix] Fix phi3v batch inference when images have different aspect ratio (#7392) 2024-08-10 16:19:33 +00:00
Cade Daniel
baa240252e
[Core] Fix edge case in chunked prefill + block manager v2 (#7380) 2024-08-09 23:48:49 +00:00
Antoni Baum
999ef0b917
[Misc] Add numpy implementation of compute_slot_mapping (#7377) 2024-08-09 22:52:29 +00:00
Dipika Sikka
5c6c54d67a
[Bugfix] Fix PerTensorScaleParameter weight loading for fused models (#7376) 2024-08-09 21:23:46 +00:00
Mahesh Keralapura
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time (#7089) 2024-08-09 13:55:13 -07:00
Roger Wang
70d268a399
[Bugfix] Fix ITL recording in serving benchmark (#7372) 2024-08-09 10:00:00 -07:00
Pooya Davoodi
249b88228d
[Frontend] Support embeddings in the run_batch API (#7132)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-09 09:48:21 -07:00
Alexander Matveev
74af2bbd90
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder (#7360) 2024-08-09 16:35:49 +00:00
Alexander Matveev
fc7b8d1eef
[Performance] e2e overheads reduction: Small followup diff (#7364) 2024-08-09 15:49:36 +00:00
Isotr0py
67abdbb42f
[VLM][Doc] Add stop_token_ids to InternVL example (#7354) 2024-08-09 14:51:04 +00:00
Mor Zusman
07ab160741
[Model][Jamba] Mamba cache single buffer (#6739)
Co-authored-by: Mor Zusman <morz@ai21.com>
2024-08-09 10:07:06 -04:00
Nick Hill
b4e9528f95
[Core] Streamline stream termination in AsyncLLMEngine (#7336) 2024-08-09 07:06:36 +00:00
William Lin
57b7be0e1c
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971) 2024-08-09 05:42:45 +00:00
Travis Johnson
99b4cf5f23
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-08 22:08:46 -07:00
Alexander Matveev
e02ac55617
[Performance] Optimize e2e overheads: Reduce python allocations (#7162) 2024-08-08 21:34:28 -07:00
Woosuk Kwon
73388c07a4
[TPU] Fix dockerfile.tpu (#7331) 2024-08-08 20:24:58 -07:00