eigenLiu
48d5985a08
Sync huggingface modifications of qwen Moe model ( #4774 )
2024-05-17 09:43:19 -07:00
Jinzhen Lin
33e0823de5
[Bugfix] fix rope error when load models with different dtypes ( #4835 )
2024-05-17 18:43:34 +09:00
Alexei-V-Ivanov-AMD
26148120b3
[Build/CI] Extending the set of AMD tests with Regression, Basic Correctness, Distributed, Engine, Llava Tests ( #4797 )
2024-05-16 20:58:25 -07:00
bofeng huang
0150a10630
[Frontend] OpenAI API server: Do not add bos token by default when encoding ( #4688 )
2024-05-16 18:47:22 -07:00
Kante Yin
8e7fb5d43a
Support to serve vLLM on Kubernetes with LWS ( #4829 )
...
Signed-off-by: kerthcet <kerthcet@gmail.com>
2024-05-16 16:37:29 -07:00
Woosuk Kwon
9a31a817a8
[Bugfix] Fix FP8 KV cache support ( #4869 )
2024-05-16 22:42:29 +00:00
Tyler Michael Smith
2060e93659
[Kernel] Add w8a8 CUTLASS kernels ( #4749 )
2024-05-16 18:32:50 -04:00
Silencio
8435b207af
[Kernel] Add punica dimension for Qwen1.5-32B LoRA ( #4850 )
...
Co-authored-by: Silencio <silencio@adsl-99-6-187-6.dsl.irvnca.sbcglobal.net>
2024-05-16 11:16:09 -07:00
youkaichao
10fa9eea21
[Misc] remove old comments ( #4866 )
2024-05-16 11:07:41 -07:00
youkaichao
e08188081b
[Core][Distributed] remove graph mode function ( #4818 )
2024-05-16 10:59:52 -07:00
Hongxia Yang
b5853f9963
[ROCm][AMD][Bugfix] adding a missing triton autotune config ( #4845 )
2024-05-16 10:46:52 -07:00
Simon Mo
f09edd8a25
Add JSON output support for benchmark_latency and benchmark_throughput ( #4848 )
2024-05-16 10:02:56 -07:00
Alexander Matveev
6979ade384
Add GPTQ Marlin 2:4 sparse structured support ( #4790 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-05-16 12:56:15 -04:00
Pierre Dulac
9216b9cc38
[Bugfix] Bypass authorization API token for preflight requests ( #4862 )
2024-05-16 09:42:21 -07:00
Alex Wu
5e0391c040
[Frontend] Separate OpenAI Batch Runner usage from API Server ( #4851 )
2024-05-17 00:42:41 +09:00
Alex Wu
dbc0754ddf
[docs] Fix typo in examples filename openi -> openai ( #4864 )
2024-05-17 00:42:17 +09:00
Jinzhen Lin
99caa49106
[Kernel] add bfloat16 support for gptq marlin kernel ( #4788 )
2024-05-16 09:55:29 -04:00
alexm-nm
5c342570d7
Add marlin unit tests and marlin benchmark script ( #4815 )
2024-05-16 09:36:49 -04:00
Cody Yu
973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding ( #4840 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
2024-05-16 00:53:51 -07:00
Aurick Qiao
30e754390c
[Core] Implement sharded state loader ( #4690 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-05-15 22:11:54 -07:00
Alex Wu
52f8107cf2
[Frontend] Support OpenAI batch file format ( #4794 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-15 19:13:36 -04:00
Cyrus Leung
fc0d9dfc3a
[Frontend] Re-enable custom roles in Chat Completions API ( #4758 )
2024-05-15 14:58:46 -07:00
Zhuohan Li
361c461a12
[Doc] Highlight the fourth meetup in the README ( #4842 )
2024-05-15 11:38:49 -07:00
zifeitong
a5675d348b
[Bugfix] Properly set distributed_executor_backend in ParallelConfig ( #4816 )
2024-05-15 07:22:09 -07:00
Cyrus Leung
e9cdd2b1e2
[CI/Build] Further decouple HuggingFace implementation from ours during tests ( #4166 )
2024-05-14 23:38:40 -07:00
SangBin Cho
65bf2ac165
[Core][2/N] Model runner refactoring part 2. Combine prepare prefill / decode to a single API ( #4681 )
...
This PR combines prepare_prompt and prepare_decode into a single API. This PR also coelsce the attn metadata for prefill/decode to a single class and allow to slice them when running attn backend.
It also refactors subquery_start_loc which was not refactored in the previous PR
2024-05-15 14:00:10 +09:00
SangBin Cho
8a7cc254a0
Revert "[Kernel] Use flash-attn for decoding ( #3648 )" ( #4820 )
...
Lora 3 & 4 test seems to have illegal memory access failure after this commit;
[2024-05-14 23:51:18,182 E 22 22] logging.cc:101: Unhandled exception: N3c105ErrorE. what(): CUDA error: an illegal memory access was encountered
<br class="Apple-interchange-newline">
Exmaple: https://buildkite.com/vllm/ci/builds/7382#018f793d-1527-4e1c-ab59-c3a34ec55241
This reverts commit 1356df5 .
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
2024-05-15 11:52:45 +09:00
Simon Mo
29bc01bf3b
Add 4th meetup announcement to readme ( #4817 )
2024-05-14 18:33:06 -04:00
Nick Hill
676a99982f
[Core] Add MultiprocessingGPUExecutor ( #4539 )
...
Co-authored-by: SAHIL SUNEJA <suneja@us.ibm.com>
2024-05-14 10:38:59 -07:00
Cyrus Leung
dc72402b57
[Bugfix][Doc] Fix CI failure in docs ( #4804 )
...
This PR fixes the CI failure introduced by #4798 .
The failure originates from having duplicate target names in reST, and is fixed by changing the ref targets to anonymous ones. For more information, see this discussion.
I have also changed the format of the links to be more distinct from each other.
2024-05-15 01:57:08 +09:00
Kuntai Du
ccb63a8245
[Core][Hash][Automatic Prefix caching] Accelerating the hashing function by avoiding deep copies ( #4696 )
2024-05-14 21:34:33 +09:00
Zhuohan Li
c579b750a0
[Doc] Add meetups to the doc ( #4798 )
2024-05-13 18:48:00 -07:00
Cyrus Leung
4bfa7e7f75
[Doc] Add API reference for offline inference ( #4710 )
2024-05-13 17:47:42 -07:00
Zhuohan Li
ac1fbf7fd2
[Doc] Shorten README by removing supported model list ( #4796 )
2024-05-13 16:23:54 -07:00
Philipp Moritz
33d3914b1e
[Bugfix] Fix dynamic FP8 quantization for Mixtral ( #4793 )
2024-05-13 19:00:27 -04:00
Stephen Krider
1356df53bd
[Kernel] Use flash-attn for decoding ( #3648 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: LiuXiaoxuanPKU <lilyliupku@gmail.com>
2024-05-13 15:50:33 -07:00
Cody Yu
ce532ff45c
[Speculative decoding] Improve n-gram efficiency ( #4724 )
2024-05-13 15:00:13 -07:00
Sanger Steel
8bc68e198c
[Frontend] [Core] perf: Automatically detect vLLM-tensorized model, update tensorizer to version 2.9.0 ( #4208 )
2024-05-13 14:57:07 -07:00
Woosuk Kwon
0fca3cdcf2
[Misc] Enhance attention selector ( #4751 )
2024-05-13 10:47:25 -07:00
SangBin Cho
e7c46b9527
[Scheduler] Warning upon preemption and Swapping ( #4647 )
...
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
2024-05-13 23:50:44 +09:00
Cyrus Leung
350f9e107f
[CI/Build] Move test_utils.py to tests/utils.py ( #4425 )
...
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)
Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
youkaichao
702bee461f
[Core][Distributed] refactor custom allreduce to support multiple tp groups ( #4754 )
2024-05-12 17:47:59 -07:00
Swapnil Parekh
a7be4d0072
[CORE] Improvement in ranks code ( #4718 )
2024-05-12 17:47:47 -07:00
Robert Shaw
a709e87a4f
[CI/Build] Tweak Marlin Nondeterminism Issues ( #4713 )
2024-05-12 17:46:31 -07:00
Yikang Shen
6eaccb7353
[Model] Add support for IBM Granite Code models ( #4636 )
2024-05-11 21:27:24 -07:00
Chang Su
e254497b66
[Model][Misc] Add e5-mistral-7b-instruct and Embedding API ( #3734 )
2024-05-11 11:30:37 -07:00
youkaichao
4e12131089
[Core][Test] fix function name typo in custom allreduce ( #4750 )
2024-05-10 15:14:40 -07:00
Robert Shaw
fcc2994be6
[CI] Nits for bad initialization of SeqGroup in testing ( #4748 )
2024-05-10 18:01:01 -04:00
heeju-kim2
2e7796f2cf
[Speculative decoding] CUDA graph support ( #4295 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-05-10 17:36:25 +00:00
Allen.Dou
706588a77d
[Bugfix] Fix CLI arguments in OpenAI server docs ( #4729 )
2024-05-11 00:00:56 +09:00