Lily Liu
43c413ec57
[Kernel] Use flashinfer for decoding ( #4353 )
...
Co-authored-by: LiuXiaoxuanPKU <llilyliupku@gmail.com>
2024-05-03 15:51:27 -07:00
Sebastian Schoennenbeck
f8e7adda21
Fix/async chat serving ( #2727 )
2024-05-03 11:04:14 -07:00
SangBin Cho
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term ( #4518 )
2024-05-03 10:20:12 -07:00
youkaichao
344a5d0c33
[Core][Distributed] enable allreduce for multiple tp groups ( #4566 )
2024-05-02 17:32:33 -07:00
SangBin Cho
0f8a91401c
[Core] Ignore infeasible swap requests. ( #4557 )
2024-05-02 14:31:20 -07:00
Michał Moskal
32881f3f31
[kernel] fix sliding window in prefix prefill Triton kernel ( #4405 )
...
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2024-05-02 11:23:37 -07:00
alexm-nm
7038e8b803
[Kernel] Support running GPTQ 8-bit models in Marlin ( #4533 )
2024-05-02 12:56:22 -04:00
youkaichao
2a85f93007
[Core][Distributed] enable multiple tp group ( #4512 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-05-02 04:28:21 +00:00
Ronen Schaffer
5e401bce17
[CI]Add regression tests to ensure the async engine generates metrics ( #4524 )
2024-05-01 19:57:12 -07:00
SangBin Cho
0d62fe58db
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.n is not 1 and max_tokens is large & Add tests for preemption ( #4451 )
2024-05-01 19:24:13 -07:00
Danny Guinther
b8afa8b95a
[MISC] Rework logger to enable pythonic custom logging configuration to be provided ( #4273 )
2024-05-01 17:34:40 -07:00
sasha0552
c47ba4aaa9
[Bugfix] Add validation for seed ( #4529 )
2024-05-01 19:31:22 +00:00
Nick Hill
a657bfc48a
[Core] Add multiproc_worker_utils for multiprocessing-based workers ( #4357 )
2024-05-01 18:41:59 +00:00
leiwen83
24750f4cad
[Core] Enable prefix caching with block manager v2 enabled ( #4142 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
2024-05-01 11:20:32 -07:00
leiwen83
b38e42fbca
[Speculative decoding] Add ngram prompt lookup decoding ( #4237 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-01 11:13:03 -07:00
SangBin Cho
6f1df80436
[Test] Add ignore_eos test ( #4519 )
2024-05-01 08:45:42 -04:00
Jee Li
d6f4bd7cdd
[Misc]Add customized information for models ( #4132 )
2024-04-30 21:18:14 -07:00
Robert Caulk
c3845d82dc
Allow user to define whitespace pattern for outlines ( #4305 )
2024-04-30 20:48:39 -07:00
Florian Greinacher
a494140433
[Frontend] Support complex message content for chat completions endpoint ( #3467 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
2024-04-30 16:28:46 -07:00
Robert Shaw
111815d482
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) ( #4332 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-04-30 21:46:12 +00:00
leiwen83
4bb53e2dde
[BugFix] fix num_lookahead_slots missing in async executor ( #4165 )
...
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-04-30 10:12:59 -07:00
youkaichao
f4f921b7f1
[Core][Distributed] use cpu group to broadcast metadata in cpu ( #4444 )
2024-04-29 13:52:22 -07:00
Robert Shaw
73c8d677e5
[Kernel] Marlin Expansion: Support AutoGPTQ Models with Marlin ( #3922 )
...
Co-authored-by: alexm <alexm@neuralmagic.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-29 09:35:34 -07:00
Prashant Gupta
d6e520e170
[Core] Support offline use of local cache for models ( #4374 )
...
Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com>
Co-authored-by: Travis Johnson <tjohnson31415@gmail.com>
2024-04-27 09:59:55 -07:00
Nick Hill
81661da7b2
[BugFix] Fix min_tokens when eos_token_id is None ( #4389 )
...
Co-authored-by: DefTruth <31974251+deftruth@users.noreply.github.com>
2024-04-27 09:52:46 -07:00
Ruoyu Qin
dfea173148
[Bugfix] Abort requests when the connection to /v1/completions is interrupted ( #4363 )
2024-04-27 09:48:37 -07:00
Roy
7134303cbb
[Bugfix][Core] Fix get decoding config from ray ( #4335 )
2024-04-27 11:30:08 +00:00
Austin Veselka
eefeb16464
[Kernel] Full Tensor Parallelism for LoRA Layers ( #3524 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-04-27 00:03:48 -07:00
Cyrus Leung
8947bc3c15
[Frontend][Bugfix] Disallow extra fields in OpenAI API ( #4355 )
2024-04-27 05:08:24 +00:00
Cody Yu
a62aaf1df5
[Misc][Refactor] Generalize linear_method to be quant_method ( #4373 )
2024-04-26 16:41:14 -04:00
SangBin Cho
603ad84815
[Core] Refactoring sampler and support prompt logprob for chunked prefill ( #4309 )
2024-04-26 13:02:02 +00:00
Cyrus Leung
a74dee9b62
[Bugfix] Fix parameter name in get_tokenizer ( #4107 )
2024-04-25 19:10:48 -07:00
Woosuk Kwon
468d761b32
[Misc] Reduce supported Punica dtypes ( #4304 )
2024-04-23 18:54:33 -07:00
youkaichao
91f50a6fe2
[Core][Distributed] use cpu/gloo to initialize pynccl ( #4248 )
2024-04-23 18:32:19 -07:00
Cyrus Leung
1e8f4252aa
[Bugfix][Frontend] Raise exception when file-like chat template fails to be opened ( #4292 )
2024-04-23 18:19:03 +00:00
James Fleming
2b7949c1c2
AQLM CUDA support ( #3287 )
...
Co-authored-by: mgoin <michael@neuralmagic.com>
2024-04-23 13:59:33 -04:00
Cade Daniel
62b8aebc6f
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. ( #3951 )
2024-04-23 08:02:36 +00:00
SangBin Cho
050f285ff6
[Core] Scheduling optimization 2 ( #4280 )
2024-04-23 08:02:11 +00:00
SangBin Cho
ad8d696a99
[Core] Scheduler perf fix ( #4270 )
2024-04-22 21:11:06 +00:00
GeauxEric
a37d815b83
Make initialization of tokenizer and detokenizer optional ( #3748 )
...
Co-authored-by: Yun Ding <yunding@nvidia.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-04-21 22:06:46 +00:00
nunjunj
91528575ec
[Frontend] multiple sampling params support ( #3570 )
2024-04-20 00:11:57 -07:00
Cody Yu
a22cdea371
[Kernel][FP8] Initial support with dynamic per-tensor scaling ( #4118 )
...
Provide an initial support to FP8 computation. This PR is inspired by HuggingFace TGI: huggingface/text-generation-inference#1726
This feature can be enabled with --quantization fp8 or -q fp8 when launching an engine.
Algorithm:
We still load a model checkpoint in FP16/BF16. After the weights are loaded, Fp8LinearMethod calculates the per-tensor scaling factor of weights and quantizes the weights accordingly. The scaling factor will then be stored for future use. Meanwhile, the per-tensor scaling factor for activations is calculated in every forward pass.
Initial Results:
Currently tested Mistral-7B on 1xH100. With prompt length ~5 and decoding length 128:
BF16: 1.47s
FP8: 1.66s
I'll try to use larger models and try to find more performance bottleneck. Meanwhile, you're welcome to try this code.
2024-04-20 04:28:57 +00:00
Ayush Rautwar
138485a82d
[Bugfix] Add fix for JSON whitespace ( #4189 )
...
Co-authored-by: Ubuntu <ubuntu@ip-172-31-13-147.ec2.internal>
2024-04-19 20:49:22 -07:00
Jee Li
d17c8477f1
[Bugfix] Fix LoRA loading check ( #4138 )
...
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-04-19 00:59:54 -07:00
youkaichao
8a7a3e4436
[Core] add an option to log every function call to for debugging hang/crash in distributed inference ( #4079 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-18 16:15:12 -07:00
James Whedbee
e1bb2fd52d
[Bugfix] Support logprobs when using guided_json and other constrained decoding fields ( #4149 )
2024-04-18 21:12:55 +00:00
Michał Moskal
e8cc7967ff
[Bugfix][Kernel] allow non-power-of-two head sizes in prefix prefill ( #4128 )
2024-04-18 00:51:28 -07:00
Michael Goin
53b018edcb
[Bugfix] Get available quantization methods from quantization registry ( #4098 )
2024-04-18 00:21:55 -07:00
youkaichao
6dc1fc9cfe
[Core] nccl integrity check and test ( #4155 )
...
[Core] Add integrity check during initialization; add test for it (#4155 )
2024-04-17 22:28:52 -07:00
Shoichi Uchinami
a53222544c
[Kernel] Add punica dimension for Swallow-MS-7B LoRA ( #4134 )
2024-04-17 10:02:45 -07:00
youkaichao
8438e0569e
[Core] RayWorkerVllm --> WorkerWrapper to reduce duplication ( #4024 )
...
[Core] replace narrow-usage RayWorkerVllm to general WorkerWrapper to reduce code duplication (#4024 )
2024-04-17 08:34:33 +00:00
Cade Daniel
e95cd87959
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine ( #3894 )
2024-04-16 13:09:21 -07:00
Antoni Baum
69e1d2fb69
[Core] Refactor model loading code ( #4097 )
2024-04-16 11:34:39 -07:00
Noam Gat
05434764cd
LM Format Enforcer Guided Decoding Support ( #3868 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-16 05:54:57 +00:00
SangBin Cho
4e7ee664e2
[Core] Fix engine-use-ray broken ( #4105 )
2024-04-16 05:24:53 +00:00
Sanger Steel
711a000255
[Frontend] [Core] feat: Add model loading using tensorizer ( #3476 )
2024-04-13 17:13:01 -07:00
Jee Li
989ae2538d
[Kernel] Add punica dimension for Baichuan-13B ( #4053 )
2024-04-13 07:55:05 -07:00
SangBin Cho
36729bac13
[Test] Test multiple attn backend for chunked prefill. ( #4023 )
2024-04-12 09:56:57 -07:00
Jee Li
1096717ae9
[Core] Support LoRA on quantized models ( #4012 )
2024-04-11 21:02:44 -07:00
Nick Hill
e46a60aa4c
[BugFix] Fix handling of stop strings and stop token ids ( #3672 )
2024-04-11 15:34:12 -07:00
Antoni Baum
1e96c3341a
Add extra punica sizes to support bigger vocabs ( #4015 )
2024-04-11 22:18:57 +00:00
Dylan Hawk
95e7d4a97c
Fix echo/logprob OpenAI completion bug ( #3441 )
...
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-04-11 22:15:50 +00:00
Antoni Baum
a10d3056da
[Core] Set linear_weights directly on the layer ( #3977 )
2024-04-11 16:35:51 -04:00
Kunshang Ji
e9da5a40c6
[Misc] Add indirection layer for custom ops ( #3913 )
2024-04-10 20:26:07 -07:00
SangBin Cho
e42df7227d
[Test] Add xformer and flash attn tests ( #3961 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-04-11 03:09:50 +00:00
SangBin Cho
67b4221a61
[Core][5/N] Fully working chunked prefill e2e ( #3884 )
2024-04-10 17:56:48 -07:00
youkaichao
63e7176f26
[Core][Refactor] move parallel_utils into vllm/distributed ( #3950 )
...
[WIP][Core][Refactor] move vllm/model_executor/parallel_utils into vllm/distributed and vllm/device_communicators (#3950 )
2024-04-10 15:33:30 -07:00
Travis Johnson
0258b7a94b
[Bugfix] handle prompt_logprobs in _apply_min_tokens_penalty ( #3876 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-04-10 01:39:56 -07:00
胡译文
b3104b2a10
[Bugfix] Fix logits processor when prompt_logprobs is not None ( #3899 )
2024-04-10 00:09:36 -07:00
Jee Li
11dd6ebb89
[Misc] Avoid loading incorrect LoRA config ( #3777 )
2024-04-09 19:47:15 -07:00
Cade Daniel
e7c7067b45
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" ( #3837 )
2024-04-09 11:44:15 -07:00
youkaichao
95baec828f
[Core] enable out-of-tree model register ( #3871 )
2024-04-06 17:11:41 -07:00
SangBin Cho
18de883489
[Chunked Prefill][4/n] Chunked prefill scheduler. ( #3853 )
2024-04-05 10:17:58 -07:00
Cade Daniel
e5043a3e75
[Misc] Add pytest marker to opt-out of global test cleanup ( #3863 )
2024-04-04 21:54:16 -07:00
Matthias Gerstgrasser
aabe8f40f2
[Core] [Frontend] Make detokenization optional ( #3749 )
...
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-04-03 21:52:18 -07:00
Michael Feil
537ee25f43
[Core] Enable hf_transfer by default if available ( #3817 )
2024-04-04 04:02:43 +00:00
Adrian Abeyta
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) ( #3290 )
...
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
SangBin Cho
3dcb3e8b98
[3/N] Refactor scheduler for chunked prefill scheduling ( #3550 )
2024-04-03 14:13:49 -07:00
Cade Daniel
5757d90e26
[Speculative decoding] Adding configuration object for speculative decoding ( #3706 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
2024-04-03 00:40:57 +00:00
Cade Daniel
eb69d68804
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup ( #3783 )
2024-04-02 00:49:51 +00:00
Qubitium
7d4e1b85e7
[Misc] Add support for new autogptq checkpoint_format ( #3689 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-04-01 19:32:01 -04:00
Cade Daniel
93deb0b38f
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding ( #3250 )
2024-04-01 22:55:24 +00:00
Nick Hill
49782fcb76
[Misc] Some minor simplifications to detokenization logic ( #3670 )
...
Some simplifications made for clarity.
Also moves detokenization-related functions from tokenizer.py to detokenizer.py.
2024-04-01 13:22:06 -07:00
Robert Shaw
563c1d7ec5
[CI/Build] Make Marlin Tests Green ( #3753 )
2024-03-30 19:18:34 -07:00
mawong-amd
b6d103542c
[Kernel] Layernorm performance optimization ( #3662 )
2024-03-30 14:26:38 -07:00
Roy
f510395bbf
[BugFix][Frontend] Fix completion logprobs=0 error ( #3731 )
2024-03-29 09:38:21 -07:00
Roy
6110c39dc8
[BugFix] Fix tokenizer out of vocab size ( #3685 )
2024-03-29 08:18:59 -07:00
youkaichao
756b30a5f3
[Core][Test] move local_rank to the last arg with default value( #3711 )
...
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711 )
2024-03-28 21:19:45 -07:00
SangBin Cho
26422e477b
[Test] Make model tests run again and remove --forked from pytest ( #3631 )
...
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-28 21:06:40 -07:00
Roy
515386ef3c
[Core] Support multi-node inference(eager and cuda graph) ( #3686 )
2024-03-28 15:01:55 -07:00
SangBin Cho
b51c1cc9d2
[2/N] Chunked prefill data update ( #3538 )
2024-03-28 10:06:01 -07:00
Cade Daniel
14ccd94c89
[Core][Bugfix]Refactor block manager for better testability ( #3492 )
2024-03-27 23:59:28 -07:00
Roger Wang
45b6ef6513
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark ( #3277 )
2024-03-27 13:39:26 -07:00
youkaichao
8f44facddd
[Core] remove cupy dependency ( #3625 )
2024-03-27 00:33:26 -07:00
Jee Li
566b57c5c4
[Kernel] support non-zero cuda devices in punica kernels ( #3636 )
2024-03-27 00:37:42 +00:00
Jee Li
8af890a865
Enable more models to inference based on LoRA ( #3382 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-03-25 18:09:31 -07:00
Nick Hill
dfeb2ecc3a
[Misc] Include matched stop string/token in responses ( #2976 )
...
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>
2024-03-25 17:31:32 -07:00
xwjiang2010
64172a976c
[Feature] Add vision language model support. ( #3042 )
2024-03-25 14:16:30 -07:00
Simon Mo
f408d05c52
hotfix isort on logprobs ranks pr ( #3622 )
2024-03-25 11:55:46 -07:00
Dylan Hawk
0b4997e05c
[Bugfix] API stream returning two stops ( #3450 )
...
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
2024-03-25 10:14:34 -07:00
Travis Johnson
c13ad1b7bd
feat: implement the min_tokens sampling parameter ( #3124 )
...
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-03-25 10:14:26 -07:00
Swapnil Parekh
819924e749
[Core] Adding token ranks along with logprobs ( #3516 )
...
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
2024-03-25 10:13:10 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Woosuk Kwon
925f3332ca
[Core] Refactor Attention Take 2 ( #3462 )
2024-03-25 04:39:33 +00:00
youkaichao
837e185142
[CI/Build] fix flaky test ( #3602 )
2024-03-24 17:43:05 -07:00
youkaichao
8b268a46a7
[CI] typo fix: is_hip --> is_hip() ( #3595 )
2024-03-24 16:03:06 -07:00
Nick Hill
41deac4a3d
[BugFix] 1D query fix for MoE models ( #3597 )
2024-03-24 16:00:16 -07:00
Antoni Baum
bfdb1ba5c3
[Core] Improve detokenization performance for prefill ( #3469 )
...
Co-authored-by: MeloYang <meloyang05@gmail.com>
2024-03-22 13:44:12 -07:00
Thomas Parnell
cf2f084d56
Dynamic scheduler delay to improve ITL performance ( #3279 )
...
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
2024-03-22 12:28:14 -07:00
Zhuohan Li
e90fc21f2e
[Hardware][Neuron] Refactor neuron support ( #3471 )
2024-03-22 01:22:17 +00:00
Roy
ea5f14e6ff
[Bugfix][Model] Fix Qwen2 ( #3554 )
2024-03-22 00:18:58 +00:00
Roy
f1c0fc3919
Migrate logits computation and gather to model_runner ( #3233 )
2024-03-20 23:25:01 +00:00
SangBin Cho
6e435de766
[1/n][Chunked Prefill] Refactor input query shapes ( #3236 )
2024-03-20 14:46:05 -07:00
Antoni Baum
426ec4ec67
[1/n] Triton sampling kernel ( #3186 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-20 14:45:08 -07:00
Woosuk Kwon
5ee14494e4
[Misc] Remove cache stream and cache events ( #3461 )
2024-03-20 00:38:53 -07:00
ElizaWszola
9474e89ba4
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled ( #3357 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-20 00:11:11 -07:00
Robert Shaw
097aa0ea22
[CI/Build] Fix Bad Import In Test ( #3473 )
2024-03-18 20:28:00 +00:00
Simon Mo
120157fd2a
Support arbitrary json_object in OpenAI and Context Free Grammar ( #3211 )
2024-03-16 13:35:27 -07:00
simon-mo
ad50bf4b25
fix lint
2024-03-15 22:23:38 -07:00
Tao He
3123f15138
Fixes the incorrect argument in the prefix-prefill test cases ( #3246 )
2024-03-15 20:58:10 -07:00
Antoni Baum
fb96c1e98c
Asynchronous tokenization ( #2879 )
2024-03-15 23:37:01 +00:00
陈序
54be8a0be2
Fix assertion failure in Qwen 1.5 with prefix caching enabled ( #3373 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-03-14 13:56:57 -07:00
Terry
7e9bd08f60
Add batched RoPE kernel ( #3095 )
2024-03-13 13:45:26 -07:00
Or Sharir
ae0ccb4017
Add missing kernel for CodeLlama-34B on A/H100 (no tensor parallelism) when using Multi-LoRA. ( #3350 )
2024-03-13 12:18:25 -07:00
Woosuk Kwon
602358f8a8
Add kernel for GeGLU with approximate GELU ( #3337 )
2024-03-12 22:06:17 -07:00
Breno Faria
49a3c8662b
Fixes #1556 double free ( #3347 )
2024-03-13 00:30:08 +00:00
Zhuohan Li
4c922709b6
Add distributed model executor abstraction ( #3191 )
2024-03-11 11:03:45 -07:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit ( #3305 )
2024-03-10 19:49:14 -07:00
Roy
9e8744a545
[BugFix] Fix get tokenizer when using ray ( #3301 )
2024-03-10 19:17:16 -07:00
Terry
0bba88df03
Enhance lora tests with more layer and rank variations ( #3243 )
2024-03-09 17:14:16 -08:00
Cade Daniel
8437bae6ef
[Speculative decoding 3/9] Worker which speculates, scores, and applies rejection sampling ( #3103 )
2024-03-08 23:32:46 -08:00
ElizaWszola
b35cc93420
Fix auto prefix bug ( #3239 )
2024-03-07 16:37:28 -08:00
jacobthebanana
8cbba4622c
Possible fix for conflict between Automated Prefix Caching ( #2762 ) and multi-LoRA support ( #1804 ) ( #3263 )
2024-03-07 23:03:22 +00:00
Woosuk Kwon
2daf23ab0c
Separate attention backends ( #3005 )
2024-03-07 01:45:50 -08:00
Cade Daniel
a33ce60c66
[Testing] Fix core tests ( #3224 )
2024-03-06 01:04:23 -08:00
SangBin Cho
24aecf421a
[Tests] Add block manager and scheduler tests ( #3108 )
2024-03-05 18:23:34 -08:00
Nick Hill
8999ec3c16
Store eos_token_id in Sequence for easy access ( #3166 )
2024-03-05 15:35:43 -08:00
Antoni Baum
ff578cae54
Add health check, make async Engine more robust ( #3015 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-03-04 22:01:40 +00:00
Antoni Baum
22de45235c
Push logprob generation to LLMEngine ( #3065 )
...
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching ( #2762 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference ( #2497 )
...
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
felixzhu555
703e42ee4b
Add guided decoding for OpenAI API server ( #2819 )
...
Co-authored-by: br3no <breno@veltefaria.de>
Co-authored-by: simon-mo <simon.mo@hey.com>
2024-02-29 22:13:08 +00:00
Seonghyeon
bfdcfa6a05
Support starcoder2 architecture ( #3089 )
2024-02-29 00:51:48 -08:00
Woosuk Kwon
929b4f2973
Add LoRA support for Gemma ( #3050 )
2024-02-28 13:03:28 -08:00
Liangfu Chen
3b7178cfa4
[Neuron] Support inference with transformers-neuronx ( #2569 )
2024-02-28 09:34:34 -08:00
Tao He
71bcaf99e2
Enable GQA support in the prefix prefill kernels ( #3007 )
...
Signed-off-by: Tao He <sighingnow@gmail.com>
2024-02-27 01:14:31 -08:00
Dylan Hawk
e0ade06d63
Support logit bias for OpenAI API ( #3027 )
2024-02-27 11:51:53 +08:00
Jared Moore
70f3e8e3a1
Add LogProbs for Chat Completions in OpenAI ( #2918 )
2024-02-26 10:39:34 +08:00
Harry Mellor
ef978fe411
Port metrics from aioprometheus to prometheus_client ( #2730 )
2024-02-25 11:54:00 -08:00
Ronen Schaffer
4caf7044e0
Include tokens from prompt phase in counter_generation_tokens ( #2802 )
2024-02-22 14:00:12 -08:00