Commit Graph

26 Commits

Author SHA1 Message Date
Cody Yu
d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510) 2024-10-18 14:30:55 -07:00
Kuntai Du
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default (#8704)
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
sroy745
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 (#9149) 2024-10-10 14:17:17 +08:00
youkaichao
18b296fdb2
[core] remove beam search from the core (#9105) 2024-10-07 05:47:04 +00:00
sroy745
5bf8789b2a
[Bugfix] Block manager v2 with preemption and lookahead slots (#8824) 2024-09-29 09:17:45 +08:00
Cody Yu
2deb029d11
[Performance][BlockManagerV2] Mark prefix cache block as computed after schedule (#7822) 2024-08-26 11:24:53 -07:00
Cody Yu
3ac50b47d0
[MISC] Add prefix cache hit rate to metrics (#7606) 2024-08-19 11:52:07 -07:00
Cade Daniel
baa240252e
[Core] Fix edge case in chunked prefill + block manager v2 (#7380) 2024-08-09 23:48:49 +00:00
Zach Zheng
782e53ab59
[Bugfix][fast] Fix the get_num_blocks_touched logic (#6849) 2024-08-08 10:43:30 -07:00
youkaichao
c8a7e93273
[core][scheduler] simplify and improve scheduler (#6867) 2024-07-31 23:51:09 -07:00
Antoni Baum
9ed82e7074
[Misc] Small perf improvements (#6520) 2024-07-19 12:10:56 -07:00
Alexander Matveev
3476ed0809
[Core] Optimize block_manager_v2 vs block_manager_v1 (to make V2 default) (#5602) 2024-07-01 20:10:37 -07:00
Cyrus Leung
0e9164b40a
[mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
leiwen83
1b8a0d71cf
[Core][Bugfix]: fix prefix caching for blockv2 (#5364)
Signed-off-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-06-14 17:23:56 -07:00
Kaiyang Chen
10c38e3e46
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834) 2024-06-03 13:37:11 -07:00
afeldman-nm
4238bc82f2
[Core] Cross-attention KV caching and memory-management (towards eventual encoder/decoder model support) (#4837) 2024-05-29 16:09:13 +00:00
Michał Moskal
d4f3985907
[Core] Sliding window for block manager v2 (#4545)
Co-authored-by: Ruth Evans <ruthevans@Ruths-MacBook-Pro.local>
2024-05-28 11:07:07 +09:00
leiwen83
e64fde4b01
[Core][Bugfix]: fix prefix caching for blockv2 (#4764)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-24 10:07:09 -07:00
Cyrus Leung
350f9e107f
[CI/Build] Move test_utils.py to tests/utils.py (#4425)
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
youkaichao
469f85c782
[Core][Optimization] change copy-on-write from dict[int, list] to list (#4648) 2024-05-07 11:06:32 -07:00
leiwen83
24750f4cad
[Core] Enable prefix caching with block manager v2 enabled (#4142)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Sage Moore <sagemoore@utexas.edu>
2024-05-01 11:20:32 -07:00
Cade Daniel
e95cd87959
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894) 2024-04-16 13:09:21 -07:00
Cade Daniel
e7c7067b45
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" (#3837) 2024-04-09 11:44:15 -07:00
Cade Daniel
eb69d68804
[Misc] [CI/Build] Speed up block manager CPU-only unit tests ~10x by opting-out of GPU cleanup (#3783) 2024-04-02 00:49:51 +00:00
Cade Daniel
93deb0b38f
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding (#3250) 2024-04-01 22:55:24 +00:00
Cade Daniel
14ccd94c89
[Core][Bugfix]Refactor block manager for better testability (#3492) 2024-03-27 23:59:28 -07:00