Commit Graph

1218 Commits

Author SHA1 Message Date
Daniel E Marasco
e4c4072c94
[Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server (#3945) 2024-04-10 10:15:51 -07:00
youkaichao
e35397468f
[Doc] Add doc to state our model support policy (#3948)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-04-10 17:03:02 +00:00
James Whedbee
8b317c6dd0
[Model][AMD] ROCm support for 256 head dims for Gemma (#3972) 2024-04-10 08:12:00 -07:00
Woosuk Kwon
bd3c144e0b
[Bugfix][ROCm] Add numba to Dockerfile.rocm (#3962) 2024-04-10 07:37:17 -07:00
Travis Johnson
0258b7a94b
[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-04-10 01:39:56 -07:00
胡译文
b3104b2a10
[Bugfix] Fix logits processor when prompt_logprobs is not None (#3899) 2024-04-10 00:09:36 -07:00
zhaotyer
c2e00af523
[Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable (#3955)
Co-authored-by: tianyi_zhao <tianyi.zhao@transwarp.io>
2024-04-10 04:49:11 +00:00
Zedong Peng
c013d32c75
[Benchmark] Add cpu options to bench scripts (#3915) 2024-04-09 21:30:03 -07:00
Jee Li
11dd6ebb89
[Misc] Avoid loading incorrect LoRA config (#3777) 2024-04-09 19:47:15 -07:00
Juan Villamizar
6c0b04515f
[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm (#3643)
Co-authored-by: jpvillam <jpvillam@amd.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-09 15:10:47 -07:00
Junichi Sato
e23a43aef8
[Bugfix] Fix KeyError on loading GPT-NeoX (#3925) 2024-04-09 12:11:31 -07:00
Cade Daniel
e7c7067b45
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837) 2024-04-09 11:44:15 -07:00
youkaichao
6d592eb430
[Core] separate distributed_init from worker (#3904) 2024-04-09 08:49:02 +00:00
Roy
d036198e23
[BugFix][Model] Fix commandr RoPE max_position_embeddings (#3919) 2024-04-09 06:17:21 +08:00
Matt Wong
59a6abf3c9
[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations (#3782) 2024-04-08 14:31:02 -07:00
Kiran R
bc0c0192d1
[Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration (#3767)
Co-authored-by: roy <jasonailu87@gmail.com>
2024-04-08 19:42:35 +00:00
egortolmachev
f46864d68d
[Bugfix] Added Command-R GPTQ support (#3849)
Co-authored-by: Egor Tolmachev <t333ga@gmail.com>
2024-04-08 14:59:38 +00:00
ywfang
b4543c8f6b
[Model] add minicpm (#3893) 2024-04-08 18:28:36 +08:00
Isotr0py
0ce0539d47
[Bugfix] Fix Llava inference with Tensor Parallelism. (#3883) 2024-04-07 22:54:13 +08:00
youkaichao
2f19283549
[Core] latency optimization (#3890) 2024-04-06 19:14:06 -07:00
youkaichao
95baec828f
[Core] enable out-of-tree model register (#3871) 2024-04-06 17:11:41 -07:00
youkaichao
e4be7d70bb
[CI/Benchmark] add more iteration and use median for robust latency benchmark (#3889) 2024-04-06 21:32:30 +00:00
Isotr0py
54951ac4bf
[Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism (#3869) 2024-04-05 12:02:09 -07:00
SangBin Cho
18de883489
[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853) 2024-04-05 10:17:58 -07:00
Thomas Parnell
1d7c940d74
Add option to completion API to truncate prompt tokens (#3144) 2024-04-05 10:15:42 -07:00
Woosuk Kwon
cfaf49a167
[Misc] Define common requirements (#3841) 2024-04-05 00:39:17 -07:00
Noam Gat
9edec652e2
[Bugfix] Fixing requirements.txt (#3865) 2024-04-04 23:46:01 -07:00
Cade Daniel
e0dd4d3589
[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py (#3864) 2024-04-04 21:57:33 -07:00
Cade Daniel
e5043a3e75
[Misc] Add pytest marker to opt-out of global test cleanup (#3863) 2024-04-04 21:54:16 -07:00
youkaichao
d03d64fd2e
[CI/Build] refactor dockerfile & fix pip cache
[CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels (#3859)
2024-04-04 21:53:16 -07:00
Sean Gallen
78107fa091
[Doc]Add asynchronous engine arguments to documentation. (#3810)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-04-04 21:52:01 -07:00
youkaichao
c391e4b68e
[Core] improve robustness of pynccl (#3860) 2024-04-04 16:52:12 -07:00
Saurabh Dash
9117f892f0
[Model] Cohere CommandR+ (#3829) 2024-04-04 13:31:49 -07:00
Michael Goin
db2a6a41e2
[Hardware][CPU] Update cpu torch to match default of 2.2.1 (#3854) 2024-04-04 19:49:49 +00:00
youkaichao
ca81ff5196
[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 (#3805) 2024-04-04 10:26:19 -07:00
TianYu GUO
b7782002e1
[Benchmark] Refactor sample_requests in benchmark_throughput (#3613)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-04 09:56:22 +00:00
Chang Su
819a309c0f
[Bugfix] Fix args in benchmark_serving (#3836)
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-04 07:41:05 +00:00
Matthias Gerstgrasser
aabe8f40f2
[Core] [Frontend] Make detokenization optional (#3749)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-04-03 21:52:18 -07:00
Woosuk Kwon
498eb5cfa3
[Bugfix] Add kv_scale input parameter to CPU backend (#3840) 2024-04-04 04:33:08 +00:00
Michael Feil
537ee25f43
[Core] Enable hf_transfer by default if available (#3817) 2024-04-04 04:02:43 +00:00
Tao He
294f8f6665
[BugFix] Pass tokenizer_config to local_tokenizer_group (#3754)
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-04-03 20:31:46 -07:00
Woosuk Kwon
b95047f2da
[Misc] Publish 3rd meetup slides (#3835) 2024-04-03 15:46:10 -07:00
Adrian Abeyta
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
SangBin Cho
3dcb3e8b98
[3/N] Refactor scheduler for chunked prefill scheduling (#3550) 2024-04-03 14:13:49 -07:00
Michael Feil
c64cf38673
[Doc] Update contribution guidelines for better onboarding (#3819) 2024-04-03 07:31:43 +00:00
Robert Shaw
76b889bf1d
[Doc] Update README.md (#3806) 2024-04-02 23:11:10 -07:00
Nick Hill
c9b506dad4
[BugFix] Use different mechanism to get vllm version in is_cpu() (#3804) 2024-04-02 23:06:25 -07:00
Cade Daniel
5757d90e26
[Speculative decoding] Adding configuration object for speculative decoding (#3706)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
2024-04-03 00:40:57 +00:00
youkaichao
a3c226e7eb
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary (#3803) 2024-04-02 12:57:04 -07:00
Michael Goin
b321d4881b
[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ (#3798) 2024-04-02 12:35:31 -07:00