youkaichao
|
8a7a3e4436
|
[Core] add an option to log every function call to for debugging hang/crash in distributed inference (#4079)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-04-18 16:15:12 -07:00 |
|
James Whedbee
|
e1bb2fd52d
|
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields (#4149)
|
2024-04-18 21:12:55 +00:00 |
|
Michał Moskal
|
e8cc7967ff
|
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill (#4128)
|
2024-04-18 00:51:28 -07:00 |
|
Michael Goin
|
53b018edcb
|
[Bugfix] Get available quantization methods from quantization registry (#4098)
|
2024-04-18 00:21:55 -07:00 |
|
youkaichao
|
6dc1fc9cfe
|
[Core] nccl integrity check and test (#4155)
[Core] Add integrity check during initialization; add test for it (#4155)
|
2024-04-17 22:28:52 -07:00 |
|
Shoichi Uchinami
|
a53222544c
|
[Kernel] Add punica dimension for Swallow-MS-7B LoRA (#4134)
|
2024-04-17 10:02:45 -07:00 |
|
youkaichao
|
8438e0569e
|
[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication (#4024)
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024)
|
2024-04-17 08:34:33 +00:00 |
|
Cade Daniel
|
e95cd87959
|
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894)
|
2024-04-16 13:09:21 -07:00 |
|
Antoni Baum
|
69e1d2fb69
|
[Core] Refactor model loading code (#4097)
|
2024-04-16 11:34:39 -07:00 |
|
Noam Gat
|
05434764cd
|
LM Format Enforcer Guided Decoding Support (#3868)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-04-16 05:54:57 +00:00 |
|
SangBin Cho
|
4e7ee664e2
|
[Core] Fix engine-use-ray broken (#4105)
|
2024-04-16 05:24:53 +00:00 |
|
Sanger Steel
|
711a000255
|
[Frontend] [Core] feat: Add model loading using tensorizer (#3476)
|
2024-04-13 17:13:01 -07:00 |
|
Jee Li
|
989ae2538d
|
[Kernel] Add punica dimension for Baichuan-13B (#4053)
|
2024-04-13 07:55:05 -07:00 |
|
SangBin Cho
|
36729bac13
|
[Test] Test multiple attn backend for chunked prefill. (#4023)
|
2024-04-12 09:56:57 -07:00 |
|
Jee Li
|
1096717ae9
|
[Core] Support LoRA on quantized models (#4012)
|
2024-04-11 21:02:44 -07:00 |
|
Nick Hill
|
e46a60aa4c
|
[BugFix] Fix handling of stop strings and stop token ids (#3672)
|
2024-04-11 15:34:12 -07:00 |
|
Antoni Baum
|
1e96c3341a
|
Add extra punica sizes to support bigger vocabs (#4015)
|
2024-04-11 22:18:57 +00:00 |
|
Dylan Hawk
|
95e7d4a97c
|
Fix echo/logprob OpenAI completion bug (#3441)
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
|
2024-04-11 22:15:50 +00:00 |
|
Antoni Baum
|
a10d3056da
|
[Core] Set linear_weights directly on the layer (#3977)
|
2024-04-11 16:35:51 -04:00 |
|
Kunshang Ji
|
e9da5a40c6
|
[Misc] Add indirection layer for custom ops (#3913)
|
2024-04-10 20:26:07 -07:00 |
|
SangBin Cho
|
e42df7227d
|
[Test] Add xformer and flash attn tests (#3961)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-04-11 03:09:50 +00:00 |
|
SangBin Cho
|
67b4221a61
|
[Core][5/N] Fully working chunked prefill e2e (#3884)
|
2024-04-10 17:56:48 -07:00 |
|
youkaichao
|
63e7176f26
|
[Core][Refactor] move parallel_utils into vllm/distributed (#3950)
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950)
|
2024-04-10 15:33:30 -07:00 |
|
Travis Johnson
|
0258b7a94b
|
[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty (#3876)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
|
2024-04-10 01:39:56 -07:00 |
|
胡译文
|
b3104b2a10
|
[Bugfix] Fix logits processor when prompt_logprobs is not None (#3899)
|
2024-04-10 00:09:36 -07:00 |
|
Jee Li
|
11dd6ebb89
|
[Misc] Avoid loading incorrect LoRA config (#3777)
|
2024-04-09 19:47:15 -07:00 |
|
Cade Daniel
|
e7c7067b45
|
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837)
|
2024-04-09 11:44:15 -07:00 |
|
youkaichao
|
95baec828f
|
[Core] enable out-of-tree model register (#3871)
|
2024-04-06 17:11:41 -07:00 |
|
SangBin Cho
|
18de883489
|
[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853)
|
2024-04-05 10:17:58 -07:00 |
|
Cade Daniel
|
e5043a3e75
|
[Misc] Add pytest marker to opt-out of global test cleanup (#3863)
|
2024-04-04 21:54:16 -07:00 |
|
Matthias Gerstgrasser
|
aabe8f40f2
|
[Core] [Frontend] Make detokenization optional (#3749)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-04-03 21:52:18 -07:00 |
|
Michael Feil
|
537ee25f43
|
[Core] Enable hf_transfer by default if available (#3817)
|
2024-04-04 04:02:43 +00:00 |
|
Adrian Abeyta
|
2ff767b513
|
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-03 14:15:55 -07:00 |
|
SangBin Cho
|
3dcb3e8b98
|
[3/N] Refactor scheduler for chunked prefill scheduling (#3550)
|
2024-04-03 14:13:49 -07:00 |
|
Cade Daniel
|
5757d90e26
|
[Speculative decoding] Adding configuration object for speculative decoding (#3706)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
|
2024-04-03 00:40:57 +00:00 |
|
Cade Daniel
|
eb69d68804
|
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup (#3783)
|
2024-04-02 00:49:51 +00:00 |
|
Qubitium
|
7d4e1b85e7
|
[Misc] Add support for new autogptq checkpoint_format (#3689)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
|
2024-04-01 19:32:01 -04:00 |
|
Cade Daniel
|
93deb0b38f
|
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250)
|
2024-04-01 22:55:24 +00:00 |
|
Nick Hill
|
49782fcb76
|
[Misc] Some minor simplifications to detokenization logic (#3670)
Some simplifications made for clarity.
Also moves detokenization-related functions from tokenizer.py to detokenizer.py.
|
2024-04-01 13:22:06 -07:00 |
|
Robert Shaw
|
563c1d7ec5
|
[CI/Build] Make Marlin Tests Green (#3753)
|
2024-03-30 19:18:34 -07:00 |
|
mawong-amd
|
b6d103542c
|
[Kernel] Layernorm performance optimization (#3662)
|
2024-03-30 14:26:38 -07:00 |
|
Roy
|
f510395bbf
|
[BugFix][Frontend] Fix completion logprobs=0 error (#3731)
|
2024-03-29 09:38:21 -07:00 |
|
Roy
|
6110c39dc8
|
[BugFix] Fix tokenizer out of vocab size (#3685)
|
2024-03-29 08:18:59 -07:00 |
|
youkaichao
|
756b30a5f3
|
[Core][Test] move local_rank to the last arg with default value(#3711)
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711)
|
2024-03-28 21:19:45 -07:00 |
|
SangBin Cho
|
26422e477b
|
[Test] Make model tests run again and remove --forked from pytest (#3631)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-03-28 21:06:40 -07:00 |
|
Roy
|
515386ef3c
|
[Core] Support multi-node inference(eager and cuda graph) (#3686)
|
2024-03-28 15:01:55 -07:00 |
|
SangBin Cho
|
b51c1cc9d2
|
[2/N] Chunked prefill data update (#3538)
|
2024-03-28 10:06:01 -07:00 |
|
Cade Daniel
|
14ccd94c89
|
[Core][Bugfix]Refactor block manager for better testability (#3492)
|
2024-03-27 23:59:28 -07:00 |
|
Roger Wang
|
45b6ef6513
|
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277)
|
2024-03-27 13:39:26 -07:00 |
|
youkaichao
|
8f44facddd
|
[Core] remove cupy dependency (#3625)
|
2024-03-27 00:33:26 -07:00 |
|