Daniel E Marasco
|
e4c4072c94
|
[Bugfix] Remove key sorting for guided_json parameter in OpenAi compatible Server (#3945)
|
2024-04-10 10:15:51 -07:00 |
|
youkaichao
|
e35397468f
|
[Doc] Add doc to state our model support policy (#3948)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-04-10 17:03:02 +00:00 |
|
James Whedbee
|
8b317c6dd0
|
[Model][AMD] ROCm support for 256 head dims for Gemma (#3972)
|
2024-04-10 08:12:00 -07:00 |
|
Woosuk Kwon
|
bd3c144e0b
|
[Bugfix][ROCm] Add numba to Dockerfile.rocm (#3962)
|
2024-04-10 07:37:17 -07:00 |
|
Travis Johnson
|
0258b7a94b
|
[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-04-10 01:39:56 -07:00 |
|
胡译文
|
b3104b2a10
|
[Bugfix] Fix logits processor when prompt_logprobs is not None (#3899)
|
2024-04-10 00:09:36 -07:00 |
|
zhaotyer
|
c2e00af523
|
[Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable (#3955)
Co-authored-by: tianyi_zhao <tianyi.zhao@transwarp.io>
|
2024-04-10 04:49:11 +00:00 |
|
Zedong Peng
|
c013d32c75
|
[Benchmark] Add cpu options to bench scripts (#3915)
|
2024-04-09 21:30:03 -07:00 |
|
Jee Li
|
11dd6ebb89
|
[Misc] Avoid loading incorrect LoRA config (#3777)
|
2024-04-09 19:47:15 -07:00 |
|
Juan Villamizar
|
6c0b04515f
|
[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm (#3643)
Co-authored-by: jpvillam <jpvillam@amd.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-09 15:10:47 -07:00 |
|
Junichi Sato
|
e23a43aef8
|
[Bugfix] Fix KeyError on loading GPT-NeoX (#3925)
|
2024-04-09 12:11:31 -07:00 |
|
Cade Daniel
|
e7c7067b45
|
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837)
|
2024-04-09 11:44:15 -07:00 |
|
youkaichao
|
6d592eb430
|
[Core] separate distributed_init from worker (#3904)
|
2024-04-09 08:49:02 +00:00 |
|
Roy
|
d036198e23
|
[BugFix][Model] Fix commandr RoPE max_position_embeddings (#3919)
|
2024-04-09 06:17:21 +08:00 |
|
Matt Wong
|
59a6abf3c9
|
[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations (#3782)
|
2024-04-08 14:31:02 -07:00 |
|
Kiran R
|
bc0c0192d1
|
[Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration (#3767)
Co-authored-by: roy <jasonailu87@gmail.com>
|
2024-04-08 19:42:35 +00:00 |
|
egortolmachev
|
f46864d68d
|
[Bugfix] Added Command-R GPTQ support (#3849)
Co-authored-by: Egor Tolmachev <t333ga@gmail.com>
|
2024-04-08 14:59:38 +00:00 |
|
ywfang
|
b4543c8f6b
|
[Model] add minicpm (#3893)
|
2024-04-08 18:28:36 +08:00 |
|
Isotr0py
|
0ce0539d47
|
[Bugfix] Fix Llava inference with Tensor Parallelism. (#3883)
|
2024-04-07 22:54:13 +08:00 |
|
youkaichao
|
2f19283549
|
[Core] latency optimization (#3890)
|
2024-04-06 19:14:06 -07:00 |
|
youkaichao
|
95baec828f
|
[Core] enable out-of-tree model register (#3871)
|
2024-04-06 17:11:41 -07:00 |
|
youkaichao
|
e4be7d70bb
|
[CI/Benchmark] add more iteration and use median for robust latency benchmark (#3889)
|
2024-04-06 21:32:30 +00:00 |
|
Isotr0py
|
54951ac4bf
|
[Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism (#3869)
|
2024-04-05 12:02:09 -07:00 |
|
SangBin Cho
|
18de883489
|
[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853)
|
2024-04-05 10:17:58 -07:00 |
|
Thomas Parnell
|
1d7c940d74
|
Add option to completion API to truncate prompt tokens (#3144)
|
2024-04-05 10:15:42 -07:00 |
|
Woosuk Kwon
|
cfaf49a167
|
[Misc] Define common requirements (#3841)
|
2024-04-05 00:39:17 -07:00 |
|
Noam Gat
|
9edec652e2
|
[Bugfix] Fixing requirements.txt (#3865)
|
2024-04-04 23:46:01 -07:00 |
|
Cade Daniel
|
e0dd4d3589
|
[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py (#3864)
|
2024-04-04 21:57:33 -07:00 |
|
Cade Daniel
|
e5043a3e75
|
[Misc] Add pytest marker to opt-out of global test cleanup (#3863)
|
2024-04-04 21:54:16 -07:00 |
|
youkaichao
|
d03d64fd2e
|
[CI/Build] refactor dockerfile & fix pip cache
[CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels (#3859)
|
2024-04-04 21:53:16 -07:00 |
|
Sean Gallen
|
78107fa091
|
[Doc]Add asynchronous engine arguments to documentation. (#3810)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-04-04 21:52:01 -07:00 |
|
youkaichao
|
c391e4b68e
|
[Core] improve robustness of pynccl (#3860)
|
2024-04-04 16:52:12 -07:00 |
|
Saurabh Dash
|
9117f892f0
|
[Model] Cohere CommandR+ (#3829)
|
2024-04-04 13:31:49 -07:00 |
|
Michael Goin
|
db2a6a41e2
|
[Hardware][CPU] Update cpu torch to match default of 2.2.1 (#3854)
|
2024-04-04 19:49:49 +00:00 |
|
youkaichao
|
ca81ff5196
|
[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 (#3805)
|
2024-04-04 10:26:19 -07:00 |
|
TianYu GUO
|
b7782002e1
|
[Benchmark] Refactor sample_requests in benchmark_throughput (#3613)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-04-04 09:56:22 +00:00 |
|
Chang Su
|
819a309c0f
|
[Bugfix] Fix args in benchmark_serving (#3836)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-04-04 07:41:05 +00:00 |
|
Matthias Gerstgrasser
|
aabe8f40f2
|
[Core] [Frontend] Make detokenization optional (#3749)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-04-03 21:52:18 -07:00 |
|
Woosuk Kwon
|
498eb5cfa3
|
[Bugfix] Add kv_scale input parameter to CPU backend (#3840)
|
2024-04-04 04:33:08 +00:00 |
|
Michael Feil
|
537ee25f43
|
[Core] Enable hf_transfer by default if available (#3817)
|
2024-04-04 04:02:43 +00:00 |
|
Tao He
|
294f8f6665
|
[BugFix] Pass tokenizer_config to local_tokenizer_group (#3754)
Signed-off-by: Tao He <sighingnow@gmail.com>
|
2024-04-03 20:31:46 -07:00 |
|
Woosuk Kwon
|
b95047f2da
|
[Misc] Publish 3rd meetup slides (#3835)
|
2024-04-03 15:46:10 -07:00 |
|
Adrian Abeyta
|
2ff767b513
|
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-03 14:15:55 -07:00 |
|
SangBin Cho
|
3dcb3e8b98
|
[3/N] Refactor scheduler for chunked prefill scheduling (#3550)
|
2024-04-03 14:13:49 -07:00 |
|
Michael Feil
|
c64cf38673
|
[Doc] Update contribution guidelines for better onboarding (#3819)
|
2024-04-03 07:31:43 +00:00 |
|
Robert Shaw
|
76b889bf1d
|
[Doc] Update README.md (#3806)
|
2024-04-02 23:11:10 -07:00 |
|
Nick Hill
|
c9b506dad4
|
[BugFix] Use different mechanism to get vllm version in is_cpu() (#3804)
|
2024-04-02 23:06:25 -07:00 |
|
Cade Daniel
|
5757d90e26
|
[Speculative decoding] Adding configuration object for speculative decoding (#3706)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
|
2024-04-03 00:40:57 +00:00 |
|
youkaichao
|
a3c226e7eb
|
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary (#3803)
|
2024-04-02 12:57:04 -07:00 |
|
Michael Goin
|
b321d4881b
|
[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ (#3798)
|
2024-04-02 12:35:31 -07:00 |
|