Commit Graph

2388 Commits

Author SHA1 Message Date
jon-chuang
a046f86397
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel (#7208)
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-08-12 22:47:41 +00:00
Cyrus Leung
4ddc4743d7
[Core] Consolidate GB constant and enable float GB arguments (#7416) 2024-08-12 14:14:14 -07:00
Lucas Wilkinson
6aa33cb2dd
[Misc] Use scalar type to dispatch to different gptq_marlin kernels (#7323) 2024-08-12 14:40:13 -04:00
Kevin H. Luu
1137f343aa
[ci] Cancel fastcheck when PR is ready (#7433)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-12 10:59:14 -07:00
Kevin H. Luu
9b3e2edd30
[ci] Cancel fastcheck run when PR is marked ready (#7427)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-12 10:56:52 -07:00
Kevin H. Luu
65950e8f58
[ci] Entrypoints run upon changes in vllm/ (#7423)
Signed-off-by: kevin <kevin@anyscale.com>
2024-08-12 10:18:03 -07:00
Woosuk Kwon
cfba4def5d
[Bugfix] Fix logit soft cap in flash-attn backend (#7425) 2024-08-12 09:58:28 -07:00
Daniele
d2bc4510a4
[CI/Build] bump Dockerfile.neuron image base, use public ECR (#6832) 2024-08-12 09:53:35 -07:00
Cyrus Leung
24154f8618
[Frontend] Disallow passing model as both argument and option (#7347) 2024-08-12 12:58:34 +00:00
Roger Wang
e6e42e4b17
[Core][VLM] Support image embeddings as input (#6613) 2024-08-12 16:16:06 +08:00
Lily Liu
ec2affa8ae
[Kernel] Flashinfer correctness fix for v0.1.3 (#7319) 2024-08-12 07:59:17 +00:00
Roger Wang
86ab567bae
[CI/Build] Minor refactoring for vLLM assets (#7407) 2024-08-12 02:41:52 +00:00
Simon Mo
f020a6297e
[Docs] Update readme (#7316) 2024-08-11 17:13:37 -07:00
youkaichao
6c8e595710
[misc] add commit id in collect env (#7405) 2024-08-11 15:40:48 -07:00
tomeras91
02b1988b9f
[Doc] building vLLM with VLLM_TARGET_DEVICE=empty (#7403) 2024-08-11 14:38:17 -07:00
tomeras91
386087970a
[CI/Build] build on empty device for better dev experience (#4773) 2024-08-11 13:09:44 -07:00
William Lin
c08e2b3086
[core] [2/N] refactor worker_base input preparation for multi-step (#7387) 2024-08-11 08:50:08 -07:00
Noam Gat
4fb7b52a2c
Updating LM Format Enforcer version to v0.10.6 (#7189) 2024-08-11 08:11:50 -04:00
Woosuk Kwon
90bab18f24
[TPU] Use mark_dynamic to reduce compilation time (#7340) 2024-08-10 18:12:22 -07:00
Isotr0py
4c5d8e8ea9
[Bugfix] Fix phi3v batch inference when images have different aspect ratio (#7392) 2024-08-10 16:19:33 +00:00
Cade Daniel
baa240252e
[Core] Fix edge case in chunked prefill + block manager v2 (#7380) 2024-08-09 23:48:49 +00:00
Antoni Baum
999ef0b917
[Misc] Add numpy implementation of compute_slot_mapping (#7377) 2024-08-09 22:52:29 +00:00
Dipika Sikka
5c6c54d67a
[Bugfix] Fix PerTensorScaleParameter weight loading for fused models (#7376) 2024-08-09 21:23:46 +00:00
Mahesh Keralapura
933790c209
[Core] Add span metrics for model_forward, scheduler and sampler time (#7089) 2024-08-09 13:55:13 -07:00
Roger Wang
70d268a399
[Bugfix] Fix ITL recording in serving benchmark (#7372) 2024-08-09 10:00:00 -07:00
Pooya Davoodi
249b88228d
[Frontend] Support embeddings in the run_batch API (#7132)
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-08-09 09:48:21 -07:00
Alexander Matveev
74af2bbd90
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder (#7360) 2024-08-09 16:35:49 +00:00
Alexander Matveev
fc7b8d1eef
[Performance] e2e overheads reduction: Small followup diff (#7364) 2024-08-09 15:49:36 +00:00
Isotr0py
67abdbb42f
[VLM][Doc] Add stop_token_ids to InternVL example (#7354) 2024-08-09 14:51:04 +00:00
Mor Zusman
07ab160741
[Model][Jamba] Mamba cache single buffer (#6739)
Co-authored-by: Mor Zusman <morz@ai21.com>
2024-08-09 10:07:06 -04:00
Nick Hill
b4e9528f95
[Core] Streamline stream termination in AsyncLLMEngine (#7336) 2024-08-09 07:06:36 +00:00
William Lin
57b7be0e1c
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace (#6971) 2024-08-09 05:42:45 +00:00
Travis Johnson
99b4cf5f23
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-08 22:08:46 -07:00
Alexander Matveev
e02ac55617
[Performance] Optimize e2e overheads: Reduce python allocations (#7162) 2024-08-08 21:34:28 -07:00
Woosuk Kwon
73388c07a4
[TPU] Fix dockerfile.tpu (#7331) 2024-08-08 20:24:58 -07:00
Cyrus Leung
7eb4a51c5f
[Core] Support serving encoder/decoder models (#7258) 2024-08-09 10:39:41 +08:00
Siyuan Liu
0fa14907da
[TPU] Add Load-time W8A16 quantization for TPU Backend (#7005) 2024-08-08 18:35:49 -07:00
Simon Mo
5923532e15
Add Skywork AI as Sponsor (#7314) 2024-08-08 13:59:57 -07:00
Jee Jee Li
a049b107e2
[Misc] Temporarily resolve the error of BitAndBytes (#7308) 2024-08-08 13:42:58 -07:00
Isotr0py
8334c39f37
[Bugfix] Fix new Llama3.1 GGUF model loading (#7269) 2024-08-08 13:42:44 -07:00
Daniele
e904576743
[CI/Build] Dockerfile.cpu improvements (#7298) 2024-08-08 15:24:52 -04:00
Michael Goin
e14fb22e59
[Doc] Put collect_env issue output in a <detail> block (#7310) 2024-08-08 11:22:49 -07:00
Zach Zheng
782e53ab59
[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849) 2024-08-08 10:43:30 -07:00
Joe Runde
21b9c49aa3
[Frontend] Kill the server on engine death (#6594)
Signed-off-by: Joe Runde <joe@joerun.de>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
2024-08-08 09:47:48 -07:00
Luka Govedič
5fb4a3f678
[Bugfix][Kernel] Increased atol to fix failing tests (#7305) 2024-08-08 12:16:13 -04:00
Jee Jee Li
757ac70a64
[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 (#7273) 2024-08-08 14:02:41 +00:00
Murali Andoorveedu
6dffa4b0a6
[Bugfix] Fix LoRA with PP (#7292) 2024-08-08 00:02:27 -07:00
Cherilyn Buren
48abee9e54
[Frontend] remove max_num_batched_tokens limit for lora (#7288) 2024-08-08 06:17:29 +00:00
Rui Qiao
746709642c
[Misc] Fix typos in scheduler.py (#7285)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
2024-08-07 17:06:01 -07:00
Lily Liu
e53dfd3eaf
[Kernel] Fix Flashinfer Correctness (#7284) 2024-08-07 16:26:52 -07:00