Cade Daniel
e7c7067b45
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" ( #3837 )
2024-04-09 11:44:15 -07:00
SangBin Cho
18de883489
[Chunked Prefill][4/n] Chunked prefill scheduler. ( #3853 )
2024-04-05 10:17:58 -07:00
Adrian Abeyta
2ff767b513
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) ( #3290 )
...
Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: HaiShaw <hixiao@gmail.com>
Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com>
Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu>
Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com>
Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com>
Co-authored-by: guofangze <guofangze@kuaishou.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-04-03 14:15:55 -07:00
Cade Daniel
5757d90e26
[Speculative decoding] Adding configuration object for speculative decoding ( #3706 )
...
Co-authored-by: Lily Liu <lilyliupku@gmail.com>
2024-04-03 00:40:57 +00:00
bigPYJ1151
0e3f06fe9c
[Hardware][Intel] Add CPU inference backend ( #3634 )
...
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
2024-04-01 22:07:30 -07:00
Qubitium
7d4e1b85e7
[Misc] Add support for new autogptq checkpoint_format ( #3689 )
...
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
2024-04-01 19:32:01 -04:00
Cade Daniel
93deb0b38f
[Speculative decoding 4/9] Lookahead scheduling for speculative decoding ( #3250 )
2024-04-01 22:55:24 +00:00
Roger Wang
97356f3c7e
[Bugfix] Command-R Max Model Length ( #3727 )
2024-03-29 12:27:51 -07:00
SangBin Cho
b51c1cc9d2
[2/N] Chunked prefill data update ( #3538 )
2024-03-28 10:06:01 -07:00
Cade Daniel
14ccd94c89
[Core][Bugfix]Refactor block manager for better testability ( #3492 )
2024-03-27 23:59:28 -07:00
Megha Agarwal
e24336b5a7
[Model] Add support for DBRX ( #3660 )
2024-03-27 13:01:46 -07:00
xwjiang2010
64172a976c
[Feature] Add vision language model support. ( #3042 )
2024-03-25 14:16:30 -07:00
SangBin Cho
01bfb22b41
[CI] Try introducing isort. ( #3495 )
2024-03-25 07:59:47 -07:00
Thomas Parnell
cf2f084d56
Dynamic scheduler delay to improve ITL performance ( #3279 )
...
Co-authored-by: Jan van Lunteren <jvl@zurich.ibm.com>
2024-03-22 12:28:14 -07:00
Hanzhi Zhou
f721096d48
[BugFix] Some fixes for custom allreduce kernels ( #2760 )
2024-03-21 23:02:58 -07:00
Zhuohan Li
e90fc21f2e
[Hardware][Neuron] Refactor neuron support ( #3471 )
2024-03-22 01:22:17 +00:00
SangBin Cho
6e435de766
[1/n][Chunked Prefill] Refactor input query shapes ( #3236 )
2024-03-20 14:46:05 -07:00
Nick Hill
7341c77d69
[BugFix] Avoid initializing CUDA too early ( #3487 )
2024-03-18 23:05:20 -07:00
Antoni Baum
fb96c1e98c
Asynchronous tokenization ( #2879 )
2024-03-15 23:37:01 +00:00
陈序
54be8a0be2
Fix assertion failure in Qwen 1.5 with prefix caching enabled ( #3373 )
...
Co-authored-by: Cade Daniel <edacih@gmail.com>
2024-03-14 13:56:57 -07:00
Bo-Wen Wang
b167109ba1
[Fix] Fix quantization="gptq" when using Marlin ( #3319 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2024-03-12 22:51:42 -07:00
Zhuohan Li
4c922709b6
Add distributed model executor abstraction ( #3191 )
2024-03-11 11:03:45 -07:00
Zhuohan Li
2f8844ba08
Re-enable the 80 char line width limit ( #3305 )
2024-03-10 19:49:14 -07:00
Antoni Baum
22de45235c
Push logprob generation to LLMEngine ( #3065 )
...
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-03-04 19:54:06 +00:00
Philipp Moritz
17c3103c56
Make it easy to profile workers with nsight ( #3162 )
...
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
2024-03-03 16:19:13 -08:00
Sage Moore
ce4f5a29fb
Add Automatic Prefix Caching ( #2762 )
...
Co-authored-by: ElizaWszola <eliza@neuralmagic.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
2024-03-02 00:50:01 -08:00
cloudhan
baee28c46c
Reorder kv dtype check to avoid nvcc not found error on AMD platform ( #3104 )
2024-03-02 14:34:48 +08:00
Allen.Dou
29e70e3e88
allow user chose log level by --log-level instead of fixed 'info'. ( #3109 )
...
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2024-03-01 23:28:41 +00:00
Robert Shaw
c0c2335ce0
Integrate Marlin Kernels for Int4 GPTQ inference ( #2497 )
...
Co-authored-by: Robert Shaw <114415538+rib-2@users.noreply.github.com>
Co-authored-by: alexm <alexm@neuralmagic.com>
2024-03-01 12:47:51 -08:00
Allen.Dou
9289e577ec
add cache_config's info to prometheus metrics. ( #3100 )
2024-02-29 06:15:18 +00:00
Liangfu Chen
3b7178cfa4
[Neuron] Support inference with transformers-neuronx ( #2569 )
2024-02-28 09:34:34 -08:00
zhaoyang-star
57f044945f
Fix nvcc not found in vlm-openai image ( #2781 )
2024-02-22 14:25:07 -08:00
Mark Mozolewski
786b7f18a5
Add code-revision config argument for Hugging Face Hub ( #2892 )
2024-02-17 22:36:53 -08:00
Woosuk Kwon
3711811b1d
Disable custom all reduce by default ( #2808 )
2024-02-08 09:58:03 -08:00
liuyhwangyh
ed70c70ea3
modelscope: fix issue when model parameter is not a model id but path of the model. ( #2489 )
2024-02-06 09:57:15 -08:00
Kunshang Ji
96b6f475dd
Remove hardcoded device="cuda" to support more devices ( #2503 )
...
Co-authored-by: Jiang Li <jiang1.li@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
2024-02-01 15:46:39 -08:00
zspo
c664b0e683
fix some bugs ( #2689 )
2024-01-31 10:09:23 -08:00
zhaoyang-star
9090bf02e7
Support FP8-E5M2 KV Cache ( #2279 )
...
Co-authored-by: zhaoyang <zhao.yang16@zte.com.cn>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2024-01-28 16:43:54 -08:00
Hanzhi Zhou
380170038e
Implement custom all reduce kernels ( #2192 )
2024-01-27 12:46:35 -08:00
Xiang Xu
220a47627b
Use head_dim in config if exists ( #2622 )
2024-01-27 10:30:49 -08:00
Antoni Baum
9b945daaf1
[Experimental] Add multi-LoRA support ( #1804 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Shreyas Krishnaswamy <shrekris@anyscale.com>
Co-authored-by: Avnish Narayan <avnish@anyscale.com>
2024-01-23 15:26:37 -08:00
Woosuk Kwon
6ef00b03a2
Enable CUDA graph for GPTQ & SqueezeLLM ( #2318 )
2024-01-03 09:52:29 -08:00
kliuae
1b7c791d60
[ROCm] Fixes for GPTQ on ROCm ( #2180 )
2023-12-18 10:41:04 -08:00
Woosuk Kwon
6f41f0e377
Disable CUDA graph for SqueezeLLM ( #2161 )
2023-12-17 10:24:25 -08:00
Woosuk Kwon
3a765bd5e1
Temporarily enforce eager mode for GPTQ models ( #2154 )
2023-12-17 01:51:12 -08:00
Woosuk Kwon
37ca558103
Optimize model execution with CUDA graph ( #1926 )
...
Co-authored-by: Chen Shen <scv119@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-12-16 21:12:08 -08:00
Roy
eed74a558f
Simplify weight loading logic ( #2133 )
2023-12-16 12:41:23 -08:00
Woosuk Kwon
2acd76f346
[ROCm] Temporarily remove GPTQ ROCm support ( #2138 )
2023-12-15 17:13:58 -08:00
CHU Tianxiang
0fbfc4b81b
Add GPTQ support ( #916 )
2023-12-15 03:04:22 -08:00
Antoni Baum
21d93c140d
Optimize Mixtral with expert parallelism ( #2090 )
2023-12-13 23:55:07 -08:00
Woosuk Kwon
b9bcdc7158
Change the load format to pt for Mixtral ( #2028 )
2023-12-11 10:32:17 -08:00
TJian
6ccc0bfffb
Merge EmbeddedLLM/vllm-rocm into vLLM main ( #1836 )
...
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Co-authored-by: root <kuanfu.liu@akirakan.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>
Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>
2023-12-07 23:16:52 -08:00
Woosuk Kwon
27feead2f8
Refactor Worker & InputMetadata ( #1843 )
2023-11-29 22:16:37 -08:00
boydfd
4bb6b67188
fix RAM OOM when load large models in tensor parallel mode. ( #1395 )
...
Co-authored-by: ran_lin <rlin@thoughtworks.com>
2023-11-20 19:02:42 -08:00
Woosuk Kwon
be66d9b125
Fix warning msg on quantization ( #1715 )
2023-11-18 21:49:55 -08:00
liuyhwangyh
edb305584b
Support download models from www.modelscope.cn ( #1588 )
2023-11-17 20:38:31 -08:00
Woosuk Kwon
bb00f66e19
Use quantization_config in hf config ( #1695 )
2023-11-17 16:23:49 -08:00
Aaron Pham
65ea2ddf17
feat(config): support parsing torch.dtype ( #1641 )
...
Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
2023-11-16 01:31:06 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models ( #1622 )
...
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580 ).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Sin
0d578228ca
config parser: add ChatGLM2 seq_length to _get_and_verify_max_len ( #1617 )
2023-11-09 19:29:51 -08:00
GoHomeToMacDonal
1a2bbc9301
ChatGLM Support ( #1261 )
2023-11-06 16:09:33 -08:00
Antoni Baum
9f669a9a7c
Support YaRN models ( #1264 )
...
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-03 14:12:48 -07:00
chooper1
1f24755bf8
Support SqueezeLLM ( #1326 )
...
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Woosuk Kwon
c1376e0f82
Change scheduler & input tensor shape ( #1381 )
2023-10-16 17:48:42 -07:00
Zhuohan Li
9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs ( #1328 )
...
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Antoni Baum
ee92b58b3a
Move bfloat16 check to worker ( #1259 )
2023-10-07 22:10:44 -07:00
Federico Cassano
66d18a7fb0
add support for tokenizer revision ( #1163 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-02 19:19:46 -07:00
Woosuk Kwon
f936657eb6
Provide default max model length ( #1224 )
2023-09-28 14:44:02 -07:00
Chris Bamford
bb1ba58f06
[Mistral] Mistral-7B-v0.1 support ( #1196 )
...
Co-authored-by: timlacroix <t@mistral.ai>
2023-09-28 10:41:03 -07:00
Woosuk Kwon
a19bc5c628
Automatically configure max_num_batched_tokens ( #1198 )
2023-09-27 16:34:00 -07:00
Lily Liu
21877b0d75
Support Longchat and RoPE scaling ( #555 )
...
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-09-27 03:36:02 -07:00
Woosuk Kwon
9f6be8692e
Fix config for Falcon ( #1164 )
2023-09-23 17:38:43 -07:00
Antoni Baum
3302f0aef3
rope_theta and max_position_embeddings from config ( #1096 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: wnma3mz <wnma3mz@gmail.com>
2023-09-20 13:35:11 -07:00
Woosuk Kwon
e3e79e9e8a
Implement AWQ quantization support for LLaMA ( #1032 )
...
Co-authored-by: Robert Irvine <robert@seamlessml.com>
Co-authored-by: root <rirv938@gmail.com>
Co-authored-by: Casper <casperbh.96@gmail.com>
Co-authored-by: julian-q <julianhquevedo@gmail.com>
2023-09-16 00:03:37 -07:00
Jasmond L
ab019eea75
Add Model Revision Support ( #1014 )
...
Co-authored-by: Jasmond Loh <Jasmond.Loh@hotmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-09-13 15:20:02 -07:00
Antoni Baum
0bb1e885a0
Make max_model_len configurable ( #972 )
2023-09-12 16:29:19 -07:00
Kyujin Cho
898285c9bf
fix: CUDA error when inferencing with Falcon-40B base model ( #992 )
2023-09-10 01:39:02 -07:00
Zhuohan Li
c957c741d9
Enable safetensors loading for all models ( #974 )
2023-09-07 15:49:52 -07:00
Wen Sun
621980bdc0
fix: incorrect bigcode attention heads num ( #676 )
2023-08-04 10:35:22 -07:00
Zhuohan Li
1b0bd0fe8a
Add Falcon support (new) ( #592 )
2023-08-02 14:04:39 -07:00
Chaofan Lin
aa39e42c5a
fix doc ( #622 )
2023-07-31 13:11:57 -07:00
Zhuohan Li
58a072be15
[Fix] Add model sequence length into model config ( #575 )
2023-07-25 23:46:30 -07:00
Zhuohan Li
6fc2a38b11
Add support for LLaMA-2 ( #505 )
2023-07-20 11:38:27 -07:00
Lily Liu
b4b195b360
fix max seq len ( #489 )
2023-07-17 23:20:20 -07:00
Zhuohan Li
96853af5a8
Optimize MQA Kernel ( #452 )
2023-07-14 20:06:40 -04:00
Woosuk Kwon
ddfdf470ae
Add trust_remote_code arg to get_config ( #405 )
2023-07-08 15:24:17 -07:00
codethazine
a945fcc2ae
Add trust-remote-code flag to handle remote tokenizers ( #364 )
2023-07-07 11:04:58 -07:00
Woosuk Kwon
404422f42e
[Model] Add support for MPT ( #334 )
2023-07-03 16:47:53 -07:00
Zhuohan Li
d6fa1be3a8
[Quality] Add code formatter and linter ( #326 )
2023-07-03 11:31:55 -07:00
Lily Liu
dafd924c1f
Raise error for long prompt ( #273 )
2023-06-30 18:48:49 -07:00
Woosuk Kwon
998d9d1509
[Tokenizer] Add tokenizer mode ( #298 )
2023-06-28 14:19:22 -07:00
Woosuk Kwon
4338cc4750
[Tokenizer] Add an option to specify tokenizer ( #284 )
2023-06-28 09:46:58 -07:00
Woosuk Kwon
0b98ba15c7
Change the name to vLLM ( #150 )
2023-06-17 03:07:40 -07:00