youkaichao
|
95baec828f
|
[Core] enable out-of-tree model register (#3871)
|
2024-04-06 17:11:41 -07:00 |
|
SangBin Cho
|
18de883489
|
[Chunked Prefill][4/n] Chunked prefill scheduler. (#3853)
|
2024-04-05 10:17:58 -07:00 |
|
Cade Daniel
|
e5043a3e75
|
[Misc] Add pytest marker to opt-out of global test cleanup (#3863)
|
2024-04-04 21:54:16 -07:00 |
|
Matthias Gerstgrasser
|
aabe8f40f2
|
[Core] [Frontend] Make detokenization optional (#3749)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-04-03 21:52:18 -07:00 |
|
Michael Feil
|
537ee25f43
|
[Core] Enable hf_transfer by default if available (#3817)
|
2024-04-04 04:02:43 +00:00 |
|
Adrian Abeyta
|
2ff767b513
|
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290)
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
2024-04-03 14:15:55 -07:00 |
|
SangBin Cho
|
3dcb3e8b98
|
[3/N] Refactor scheduler for chunked prefill scheduling (#3550)
|
2024-04-03 14:13:49 -07:00 |
|
Cade Daniel
|
5757d90e26
|
[Speculative decoding] Adding configuration object for speculative decoding (#3706)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
|
2024-04-03 00:40:57 +00:00 |
|
Cade Daniel
|
eb69d68804
|
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup (#3783)
|
2024-04-02 00:49:51 +00:00 |
|
Qubitium
|
7d4e1b85e7
|
[Misc] Add support for new autogptq checkpoint_format (#3689)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
|
2024-04-01 19:32:01 -04:00 |
|
Cade Daniel
|
93deb0b38f
|
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250)
|
2024-04-01 22:55:24 +00:00 |
|
Nick Hill
|
49782fcb76
|
[Misc] Some minor simplifications to detokenization logic (#3670)
Some simplifications made for clarity.
Also moves detokenization-related functions from tokenizer.py to detokenizer.py.
|
2024-04-01 13:22:06 -07:00 |
|
Robert Shaw
|
563c1d7ec5
|
[CI/Build] Make Marlin Tests Green (#3753)
|
2024-03-30 19:18:34 -07:00 |
|
mawong-amd
|
b6d103542c
|
[Kernel] Layernorm performance optimization (#3662)
|
2024-03-30 14:26:38 -07:00 |
|
Roy
|
f510395bbf
|
[BugFix][Frontend] Fix completion logprobs=0 error (#3731)
|
2024-03-29 09:38:21 -07:00 |
|
Roy
|
6110c39dc8
|
[BugFix] Fix tokenizer out of vocab size (#3685)
|
2024-03-29 08:18:59 -07:00 |
|
youkaichao
|
756b30a5f3
|
[Core][Test] move local_rank to the last arg with default value(#3711)
[Core][Test] move local_rank to the last arg with default value to keep api compatible (#3711)
|
2024-03-28 21:19:45 -07:00 |
|
SangBin Cho
|
26422e477b
|
[Test] Make model tests run again and remove --forked from pytest (#3631)
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-03-28 21:06:40 -07:00 |
|
Roy
|
515386ef3c
|
[Core] Support multi-node inference(eager and cuda graph) (#3686)
|
2024-03-28 15:01:55 -07:00 |
|
SangBin Cho
|
b51c1cc9d2
|
[2/N] Chunked prefill data update (#3538)
|
2024-03-28 10:06:01 -07:00 |
|
Cade Daniel
|
14ccd94c89
|
[Core][Bugfix]Refactor block manager for better testability (#3492)
|
2024-03-27 23:59:28 -07:00 |
|
Roger Wang
|
45b6ef6513
|
feat(benchmarks): Add Prefix Caching Benchmark to Serving Benchmark (#3277)
|
2024-03-27 13:39:26 -07:00 |
|
youkaichao
|
8f44facddd
|
[Core] remove cupy dependency (#3625)
|
2024-03-27 00:33:26 -07:00 |
|
Jee Li
|
566b57c5c4
|
[Kernel] support non-zero cuda devices in punica kernels (#3636)
|
2024-03-27 00:37:42 +00:00 |
|
Jee Li
|
8af890a865
|
Enable more models to inference based on LoRA (#3382)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
|
2024-03-25 18:09:31 -07:00 |
|
Nick Hill
|
dfeb2ecc3a
|
[Misc] Include matched stop string/token in responses (#2976)
Co-authored-by: Sahil Suneja <sahilsuneja@gmail.com>
|
2024-03-25 17:31:32 -07:00 |
|
xwjiang2010
|
64172a976c
|
[Feature] Add vision language model support. (#3042)
|
2024-03-25 14:16:30 -07:00 |
|
Simon Mo
|
f408d05c52
|
hotfix isort on logprobs ranks pr (#3622)
|
2024-03-25 11:55:46 -07:00 |
|
Dylan Hawk
|
0b4997e05c
|
[Bugfix] API stream returning two stops (#3450)
Co-authored-by: Dylan Hawk <dylanwawk@gmail.com>
|
2024-03-25 10:14:34 -07:00 |
|
Travis Johnson
|
c13ad1b7bd
|
feat: implement the min_tokens sampling parameter (#3124)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
|
2024-03-25 10:14:26 -07:00 |
|
Swapnil Parekh
|
819924e749
|
[Core] Adding token ranks along with logprobs (#3516)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
|
2024-03-25 10:13:10 -07:00 |
|
SangBin Cho
|
01bfb22b41
|
[CI] Try introducing isort. (#3495)
|
2024-03-25 07:59:47 -07:00 |
|
Woosuk Kwon
|
925f3332ca
|
[Core] Refactor Attention Take 2 (#3462)
|
2024-03-25 04:39:33 +00:00 |
|
youkaichao
|
837e185142
|
[CI/Build] fix flaky test (#3602)
|
2024-03-24 17:43:05 -07:00 |
|
youkaichao
|
8b268a46a7
|
[CI] typo fix: is_hip --> is_hip() (#3595)
|
2024-03-24 16:03:06 -07:00 |
|
Nick Hill
|
41deac4a3d
|
[BugFix] 1D query fix for MoE models (#3597)
|
2024-03-24 16:00:16 -07:00 |
|
Antoni Baum
|
bfdb1ba5c3
|
[Core] Improve detokenization performance for prefill (#3469)
Co-authored-by: MeloYang <meloyang05@gmail.com>
|
2024-03-22 13:44:12 -07:00 |
|
Thomas Parnell
|
cf2f084d56
|
Dynamic scheduler delay to improve ITL performance (#3279)
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
|
2024-03-22 12:28:14 -07:00 |
|
Zhuohan Li
|
e90fc21f2e
|
[Hardware][Neuron] Refactor neuron support (#3471)
|
2024-03-22 01:22:17 +00:00 |
|
Roy
|
ea5f14e6ff
|
[Bugfix][Model] Fix Qwen2 (#3554)
|
2024-03-22 00:18:58 +00:00 |
|
Roy
|
f1c0fc3919
|
Migrate logits computation and gather to model_runner (#3233)
|
2024-03-20 23:25:01 +00:00 |
|
SangBin Cho
|
6e435de766
|
[1/n][Chunked Prefill] Refactor input query shapes (#3236)
|
2024-03-20 14:46:05 -07:00 |
|
Antoni Baum
|
426ec4ec67
|
[1/n] Triton sampling kernel (#3186)
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
|
2024-03-20 14:45:08 -07:00 |
|
Woosuk Kwon
|
5ee14494e4
|
[Misc] Remove cache stream and cache events (#3461)
|
2024-03-20 00:38:53 -07:00 |
|
ElizaWszola
|
9474e89ba4
|
[PREFIX CACHING FOLLOW UP] A bunch of fixes to block allocator performance when automatic prefix caching is disabled (#3357)
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
|
2024-03-20 00:11:11 -07:00 |
|
Robert Shaw
|
097aa0ea22
|
[CI/Build] Fix Bad Import In Test (#3473)
|
2024-03-18 20:28:00 +00:00 |
|
Simon Mo
|
120157fd2a
|
Support arbitrary json_object in OpenAI and Context Free Grammar (#3211)
|
2024-03-16 13:35:27 -07:00 |
|
simon-mo
|
ad50bf4b25
|
fix lint
|
2024-03-15 22:23:38 -07:00 |
|
Tao He
|
3123f15138
|
Fixes the incorrect argument in the prefix-prefill test cases (#3246)
|
2024-03-15 20:58:10 -07:00 |
|
Antoni Baum
|
fb96c1e98c
|
Asynchronous tokenization (#2879)
|
2024-03-15 23:37:01 +00:00 |
|