Commit Graph

50 Commits

Author SHA1 Message Date
wangshuai09
4e2d95e372
[Hardware][ROCM] using current_platform.is_rocm (#9642)
Signed-off-by: wangshuai09 <391746016@qq.com>
2024-10-28 04:07:00 +00:00
Cody Yu
d11bf435a0
[MISC] Consolidate cleanup() and refactor offline_inference_with_prefix.py (#9510) 2024-10-18 14:30:55 -07:00
Kuntai Du
81ede99ca4
[Core] Deprecating block manager v1 and make block manager v2 default (#8704)
Removing the block manager v1. This is the initial piece of prefix-caching-centric design. In order to achieve prefix-caching-centric design, we need to simplify the code path so that we only use v2 block manager (which has much higher performance on prefix caching).
2024-10-17 11:38:15 -05:00
sroy745
f3a507f1d3
[Core] Add an environment variable which needs to be set explicitly to allow BlockSpaceManagerV1 (#9149) 2024-10-10 14:17:17 +08:00
sroy745
91add85ec4
Fix failing spec decode test (#9054) 2024-10-03 23:07:29 +00:00
Lily Liu
1570203864
[Spec Decode] (1/2) Remove batch expansion (#8839) 2024-10-01 16:04:42 -07:00
Travis Johnson
01b6f9e1f0
[Core][Bugfix] Support prompt_logprobs returned with speculative decoding (#8047)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-09-24 17:29:56 -07:00
Lily Liu
c6bd70d772
[SpecDec][Misc] Cleanup, remove bonus token logic. (#8701) 2024-09-22 12:34:14 -07:00
Lily Liu
775f00f81e
[Speculative Decoding] Test refactor (#8317)
Co-authored-by: youkaichao <youkaichao@126.com>
2024-09-11 14:07:34 -07:00
Travis Johnson
cc0eaf12b1
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output (#7232)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-22 09:33:48 -04:00
Abhinav Goyal
a3fce56b88
[Speculative Decoding] EAGLE Implementation with Top-1 proposer (#6830) 2024-08-22 02:42:24 -07:00
Abhinav Goyal
312f761232
[Speculative Decoding] Fixing hidden states handling in batch expansion (#7508) 2024-08-19 17:58:14 -07:00
shangmingc
b67ae00cdb
[Misc] Add quantization config support for speculative model. (#7343) 2024-08-15 19:34:28 -07:00
Wallas Henrique
70b746efcf
[Misc] Deprecation Warning when setting --engine-use-ray (#7424)
Signed-off-by: Wallas Santos <wallashss@ibm.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: youkaichao <youkaichao@126.com>
2024-08-14 09:44:27 -07:00
Travis Johnson
99b4cf5f23
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary (#7218)
Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
2024-08-08 22:08:46 -07:00
Nick Hill
5cf9254a9c
[BugFix] Fix use of per-request seed with pipeline parallel (#6698) 2024-07-30 10:40:08 -07:00
Nick Hill
2cf0df3381
[Bugfix] Fix speculative decode seeded test (#6743) 2024-07-24 08:58:31 -07:00
Nick Hill
c882a7f5b3
[SpecDecoding] Update MLPSpeculator CI tests to use smaller model (#6714) 2024-07-24 07:34:22 +00:00
sroy745
14f91fe67c
[Spec Decode] Disable Log Prob serialization to CPU for spec decoding for both draft and target models. (#6485) 2024-07-20 23:58:58 -07:00
Thomas Parnell
a5314e8698
[Model] RowParallelLinear: pass bias to quant_method.apply (#6327)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-07-19 07:15:22 -06:00
Woo-Yeon Lee
a921e86392
[BUGFIX] Raise an error for no draft token case when draft_tp>1 (#6369) 2024-07-19 06:01:09 -07:00
Thomas Parnell
d4201e06d5
[Bugfix] Make spec. decode respect per-request seed. (#6034)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
2024-07-18 19:22:08 -07:00
Alexander Matveev
e76466dde2
[Core] draft_model_runner: Implement prepare_inputs on GPU for advance_step (#6338) 2024-07-17 14:30:28 -07:00
Cody Yu
160e1d8c99
[Misc] Log spec decode metrics (#6454) 2024-07-16 20:37:10 +00:00
Abhinav Goyal
2416b26e11
[Speculative Decoding] Medusa Implementation with Top-1 proposer (#4978) 2024-07-09 18:34:02 -07:00
Swapnil Parekh
4d6ada947c
[CORE] Adding support for insertion of soft-tuned prompts (#4645)
Co-authored-by: Swapnil Parekh <swapnilp@ibm.com>
Co-authored-by: Joe G <joseph.granados@h2o.ai>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-07-09 13:26:36 -07:00
Qubitium-ModelCloud
ee93f4f92a
[CORE] Quantized lm-head Framework (#4442)
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
Co-authored-by: ZX <zx@lbx.dev>
2024-07-02 22:25:17 +00:00
Sirej Dua
15aba081f3
[Speculative Decoding] MLPSpeculator Tensor Parallel support (1/2) (#6050)
Co-authored-by: Sirej Dua <sirej.dua@databricks.com>
Co-authored-by: Sirej Dua <Sirej Dua>
2024-07-02 07:20:29 -07:00
xwjiang2010
98d6682cd1
[VLM] Remove image_input_type from VLM config (#5852)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
2024-07-02 07:57:09 +00:00
sroy745
80ca1e6a3a
[Speculative Decoding 2/2 ] Integrate typical acceptance sampler into Spec Decode Worker (#5348) 2024-07-01 00:33:05 -07:00
Thomas Parnell
c2a8ac75e0
[CI/Build] Add E2E tests for MLPSpeculator (#5791)
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
2024-06-26 00:04:08 +00:00
Woo-Yeon Lee
2ce5d6688b
[Speculative Decoding] Support draft model on different tensor-parallel size than target model (#5414) 2024-06-25 09:56:06 +00:00
zifeitong
78687504f7
[Bugfix] AsyncLLMEngine hangs with asyncio.run (#5654) 2024-06-19 13:57:12 -07:00
Cyrus Leung
0e9164b40a
[mypy] Enable type checking for test directory (#5017) 2024-06-15 04:45:31 +00:00
Nick Hill
99dac099ab
[Core][Doc] Default to multiprocessing for single-node distributed case (#5230)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2024-06-11 11:10:41 -07:00
Cyrus Leung
7a64d24aad
[Core] Support image processor (#4197) 2024-06-02 22:56:41 -07:00
Lily Liu
d5a1697772
[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#5000) 2024-05-25 10:00:14 -07:00
Alexei-V-Ivanov-AMD
943e72ca56
[Build/CI] Enabling AMD Entrypoints Test (#4834)
Co-authored-by: Alexey Kondratiev <alexey.kondratiev@amd.com>
2024-05-20 11:29:28 -07:00
Cody Yu
973617ae02
[Speculative decoding][Re-take] Enable TP>1 speculative decoding (#4840)
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cade Daniel <cade@anyscale.com>
2024-05-16 00:53:51 -07:00
Cyrus Leung
350f9e107f
[CI/Build] Move test_utils.py to tests/utils.py (#4425)
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time)

Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
2024-05-13 23:50:09 +09:00
heeju-kim2
2e7796f2cf
[Speculative decoding] CUDA graph support (#4295)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-05-10 17:36:25 +00:00
Cody Yu
f942efb5a3
[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592)
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-05-08 21:44:00 +00:00
leiwen83
8344f7742b
[Bug fix][Core] fixup ngram not setup correctly (#4551)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
Co-authored-by: Cade Daniel <edacih@gmail.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
2024-05-07 11:40:18 -07:00
Cade Daniel
ab50275111
[Speculative decoding] Support target-model logprobs (#4378) 2024-05-03 15:52:01 -07:00
SangBin Cho
3521ba4f25
[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518) 2024-05-03 10:20:12 -07:00
leiwen83
b38e42fbca
[Speculative decoding] Add ngram prompt lookup decoding (#4237)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-05-01 11:13:03 -07:00
leiwen83
4bb53e2dde
[BugFix] fix num_lookahead_slots missing in async executor (#4165)
Co-authored-by: Lei Wen <wenlei03@qiyi.com>
2024-04-30 10:12:59 -07:00
Cade Daniel
62b8aebc6f
[Speculative decoding 7/9] Speculative decoding end-to-end correctness tests. (#3951) 2024-04-23 08:02:36 +00:00
Cade Daniel
e95cd87959
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine (#3894) 2024-04-16 13:09:21 -07:00
Cade Daniel
5757d90e26
[Speculative decoding] Adding configuration object for speculative decoding (#3706)
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
2024-04-03 00:40:57 +00:00