zhaotyer
|
c2e00af523
|
[Bugfix] fix utils.py/merge_dict func TypeError: 'type' object is not subscriptable (#3955)
Co-authored-by: tianyi_zhao <tianyi.zhao@transwarp.io>
|
2024-04-10 04:49:11 +00:00 |
|
Zedong Peng
|
c013d32c75
|
[Benchmark] Add cpu options to bench scripts (#3915)
|
2024-04-09 21:30:03 -07:00 |
|
Jee Li
|
11dd6ebb89
|
[Misc] Avoid loading incorrect LoRA config (#3777)
|
2024-04-09 19:47:15 -07:00 |
|
Juan Villamizar
|
6c0b04515f
|
[ROCm][Hardware][AMD] Use Triton Kernel for default FA on ROCm (#3643)
Co-authored-by: jpvillam <jpvillam@amd.com>
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-09 15:10:47 -07:00 |
|
Junichi Sato
|
e23a43aef8
|
[Bugfix] Fix KeyError on loading GPT-NeoX (#3925)
|
2024-04-09 12:11:31 -07:00 |
|
Cade Daniel
|
e7c7067b45
|
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837)
|
2024-04-09 11:44:15 -07:00 |
|
youkaichao
|
6d592eb430
|
[Core] separate distributed_init from worker (#3904)
|
2024-04-09 08:49:02 +00:00 |
|
Roy
|
d036198e23
|
[BugFix][Model] Fix commandr RoPE max_position_embeddings (#3919)
|
2024-04-09 06:17:21 +08:00 |
|
Matt Wong
|
59a6abf3c9
|
[Hotfix][CI/Build][Kernel] CUDA 11.8 does not support layernorm optimizations (#3782)
|
2024-04-08 14:31:02 -07:00 |
|
Kiran R
|
bc0c0192d1
|
[Bugfix] Enable Proper attention_bias Usage in Llama Model Configuration (#3767)
Co-authored-by: roy <jasonailu87@gmail.com>
|
2024-04-08 19:42:35 +00:00 |
|
egortolmachev
|
f46864d68d
|
[Bugfix] Added Command-R GPTQ support (#3849)
Co-authored-by: Egor Tolmachev <t333ga@gmail.com>
|
2024-04-08 14:59:38 +00:00 |
|
ywfang
|
b4543c8f6b
|
[Model] add minicpm (#3893)
|
2024-04-08 18:28:36 +08:00 |
|
Isotr0py
|
0ce0539d47
|
[Bugfix] Fix Llava inference with Tensor Parallelism. (#3883)
|
2024-04-07 22:54:13 +08:00 |
|
youkaichao
|
2f19283549
|
[Core] latency optimization (#3890)
|
2024-04-06 19:14:06 -07:00 |
|
youkaichao
|
95baec828f
|
[Core] enable out-of-tree model register (#3871)
|
2024-04-06 17:11:41 -07:00 |
|
youkaichao
|
e4be7d70bb
|
[CI/Benchmark] add more iteration and use median for robust latency benchmark (#3889)
|
2024-04-06 21:32:30 +00:00 |
|
Isotr0py
|
54951ac4bf
|
[Bugfix] Fix incorrect output on OLMo models in Tensor Parallelism (#3869)
|
2024-04-05 12:02:09 -07:00 |
|
SangBin Cho
|
18de883489
|
[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853)
|
2024-04-05 10:17:58 -07:00 |
|
Thomas Parnell
|
1d7c940d74
|
Add option to completion API to truncate prompt tokens (#3144)
|
2024-04-05 10:15:42 -07:00 |
|
Woosuk Kwon
|
cfaf49a167
|
[Misc] Define common requirements (#3841)
|
2024-04-05 00:39:17 -07:00 |
|
Noam Gat
|
9edec652e2
|
[Bugfix] Fixing requirements.txt (#3865)
|
2024-04-04 23:46:01 -07:00 |
|
Cade Daniel
|
e0dd4d3589
|
[Misc] Fix linter issues in examples/fp8/quantizer/quantize.py (#3864)
|
2024-04-04 21:57:33 -07:00 |
|
Cade Daniel
|
e5043a3e75
|
[Misc] Add pytest marker to opt-out of global test cleanup (#3863)
|
2024-04-04 21:54:16 -07:00 |
|
youkaichao
|
d03d64fd2e
|
[CI/Build] refactor dockerfile & fix pip cache
[CI/Build] fix pip cache with vllm_nccl & refactor dockerfile to build wheels (#3859)
|
2024-04-04 21:53:16 -07:00 |
|
Sean Gallen
|
78107fa091
|
[Doc]Add asynchronous engine arguments to documentation. (#3810)
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-04-04 21:52:01 -07:00 |
|
youkaichao
|
c391e4b68e
|
[Core] improve robustness of pynccl (#3860)
|
2024-04-04 16:52:12 -07:00 |
|
Saurabh Dash
|
9117f892f0
|
[Model] Cohere CommandR+ (#3829)
|
2024-04-04 13:31:49 -07:00 |
|
Michael Goin
|
db2a6a41e2
|
[Hardware][CPU] Update cpu torch to match default of 2.2.1 (#3854)
|
2024-04-04 19:49:49 +00:00 |
|
youkaichao
|
ca81ff5196
|
[Core] manage nccl via a pypi package & upgrade to pt 2.2.1 (#3805)
|
2024-04-04 10:26:19 -07:00 |
|
TianYu GUO
|
b7782002e1
|
[Benchmark] Refactor sample_requests in benchmark_throughput (#3613)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-04-04 09:56:22 +00:00 |
|
Chang Su
|
819a309c0f
|
[Bugfix] Fix args in benchmark_serving (#3836)
Co-authored-by: Roger Wang <ywang@roblox.com>
|
2024-04-04 07:41:05 +00:00 |
|
Matthias Gerstgrasser
|
aabe8f40f2
|
[Core] [Frontend] Make detokenization optional (#3749)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-04-03 21:52:18 -07:00 |
|
Woosuk Kwon
|
498eb5cfa3
|
[Bugfix] Add kv_scale input parameter to CPU backend (#3840)
|
2024-04-04 04:33:08 +00:00 |
|
Michael Feil
|
537ee25f43
|
[Core] Enable hf_transfer by default if available (#3817)
|
2024-04-04 04:02:43 +00:00 |
|
Tao He
|
294f8f6665
|
[BugFix] Pass tokenizer_config to local_tokenizer_group (#3754)
Signed-off-by: Tao He <sighingnow@gmail.com>
|
2024-04-03 20:31:46 -07:00 |
|
Woosuk Kwon
|
b95047f2da
|
[Misc] Publish 3rd meetup slides (#3835)
|
2024-04-03 15:46:10 -07:00 |
|
Adrian Abeyta
|
2ff767b513
|
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-03 14:15:55 -07:00 |
|
SangBin Cho
|
3dcb3e8b98
|
[3/N] Refactor scheduler for chunked prefill scheduling (#3550)
|
2024-04-03 14:13:49 -07:00 |
|
Michael Feil
|
c64cf38673
|
[Doc] Update contribution guidelines for better onboarding (#3819)
|
2024-04-03 07:31:43 +00:00 |
|
Robert Shaw
|
76b889bf1d
|
[Doc] Update README.md (#3806)
|
2024-04-02 23:11:10 -07:00 |
|
Nick Hill
|
c9b506dad4
|
[BugFix] Use different mechanism to get vllm version in is_cpu() (#3804)
|
2024-04-02 23:06:25 -07:00 |
|
Cade Daniel
|
5757d90e26
|
[Speculative decoding] Adding configuration object for speculative decoding (#3706)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
|
2024-04-03 00:40:57 +00:00 |
|
youkaichao
|
a3c226e7eb
|
[CI/Build] 0.4.0.post1, fix sm 7.0/7.5 binary (#3803)
|
2024-04-02 12:57:04 -07:00 |
|
Michael Goin
|
b321d4881b
|
[Bugfix] Add __init__.py files for vllm/core/block/ and vllm/spec_decode/ (#3798)
|
2024-04-02 12:35:31 -07:00 |
|
leiwen83
|
ad6eca408b
|
Fix early CUDA init via get_architecture_class_name import (#3770)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
|
2024-04-02 11:56:26 -07:00 |
|
youkaichao
|
205b94942e
|
[CI/Build] fix TORCH_CUDA_ARCH_LIST in wheel build (#3801)
|
2024-04-02 11:54:33 -07:00 |
|
Roger Wang
|
3bec41f41a
|
[Doc] Fix vLLMEngine Doc Page (#3791)
|
2024-04-02 09:49:37 -07:00 |
|
A-Mahla
|
0739b1947f
|
[Frontend][Bugfix] allow using the default middleware with a root path (#3788)
Co-authored-by: A-Mahla <>
|
2024-04-02 01:20:28 -07:00 |
|
bigPYJ1151
|
77a6572aa5
|
[HotFix] [CI/Build] Minor fix for CPU backend CI (#3787)
|
2024-04-01 22:50:53 -07:00 |
|
bigPYJ1151
|
0e3f06fe9c
|
[Hardware][Intel] Add CPU inference backend (#3634)
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
|
2024-04-01 22:07:30 -07:00 |
|