Commit Graph

103 Commits

Author SHA1 Message Date
ljss
e1054247ba
[Optimization] Implement fused add rmsnorm (#1667) 2023-11-18 18:18:02 -08:00
Woosuk Kwon
8d17774f92
Add AWQ support for all models (#1714) 2023-11-18 17:56:47 -08:00
twaka
e946260cf3
use get_tensor in safe_open (#1696) 2023-11-18 16:45:18 -08:00
Woosuk Kwon
bb00f66e19
Use quantization_config in hf config (#1695) 2023-11-17 16:23:49 -08:00
Roy
e87557b069
Support Min P Sampler (#1642) 2023-11-17 16:20:49 -08:00
maximzubkov
521b35f799
Support Microsoft Phi 1.5 (#1664) 2023-11-16 14:28:39 -08:00
twaka
2a2c135b41
Fix loading error when safetensors contains empty tensor (#1687) 2023-11-16 10:38:10 -08:00
Megha Agarwal
b514d3c496
Revert MptConfig to MPTConfig (#1668) 2023-11-16 01:19:39 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622)
Refactor the tensor parallelism, quantization, and weight-loading codes.

Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Woosuk Kwon
054072bee5
[Minor] Move RoPE selection logic to get_rope (#1633) 2023-11-12 16:04:50 -08:00
lirui
eb825c1e74
Fix #1474 - AssertionError:assert param_slice.shape == loaded_weight.shape (#1631) 2023-11-12 15:53:12 -08:00
forpanyang
ab9e8488d5
Add Yi model to quantization support (#1600) 2023-11-09 11:47:14 -08:00
GoHomeToMacDonal
1a2bbc9301
ChatGLM Support (#1261) 2023-11-06 16:09:33 -08:00
Roy
e7f579eb97
Support Yi model (#1567) 2023-11-06 15:26:03 -08:00
Antoni Baum
9f669a9a7c
Support YaRN models (#1264)
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-03 14:12:48 -07:00
Noam Gat
555bdcc5a3
Added logits processor API to sampling params (#1469) 2023-11-03 14:12:15 -07:00
Antoni Baum
9738b84a08
Force paged attention v2 for long contexts (#1510) 2023-11-01 16:24:32 -07:00
Woosuk Kwon
1fe0990023
Remove MPTConfig (#1529) 2023-11-01 15:29:05 -07:00
Wenfei Yan
cf8849f2d6
Add MptForCausalLM key in model_loader (#1526) 2023-10-31 15:46:53 -07:00
Antoni Baum
15f5632365
Delay GPU->CPU sync in sampling (#1337) 2023-10-30 09:01:34 -07:00
Woosuk Kwon
aa9af07cac
Fix bias in InternLM (#1501) 2023-10-29 16:24:18 -07:00
ljss
69be658bba
Support repetition_penalty (#1424) 2023-10-29 10:02:41 -07:00
Qing
28b47d1e49
Add rope_scaling to Aquila model (#1457) 2023-10-29 04:25:21 -07:00
chooper1
1f24755bf8
Support SqueezeLLM (#1326)
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Wang Ran (汪然)
d189170b6c
remove useless statements (#1408) 2023-10-20 08:52:07 -07:00
Wang Ran (汪然)
a132435204
Fix typo (#1383) 2023-10-16 21:53:37 -07:00
Woosuk Kwon
c1376e0f82
Change scheduler & input tensor shape (#1381) 2023-10-16 17:48:42 -07:00
Zhuohan Li
9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs (#1328)
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Woosuk Kwon
928de46888
Implement PagedAttention V2 (#1348) 2023-10-16 00:59:57 -07:00
Lu Wang
de89472897
Fix the issue for AquilaChat2-* models (#1339) 2023-10-13 11:51:29 -07:00
Woosuk Kwon
e7c8555d06
Bump up transformers version & Remove MistralConfig (#1254) 2023-10-13 10:05:26 -07:00
Woosuk Kwon
875afe38ab
Add blacklist in model checkpoint (#1325) 2023-10-12 01:05:37 -07:00
amaleshvemula
ee8217e5be
Add Mistral to quantization model list (#1278) 2023-10-11 00:26:24 -07:00
twaka
8285736840
workaround of AWQ for Turing GPUs (#1252) 2023-10-10 19:48:16 -07:00
yhlskt23
91fce82c6f
change the timing of sorting logits (#1309) 2023-10-10 19:37:42 -07:00
Zhuohan Li
b95ee898fe
[Minor] Fix comment in mistral.py (#1303) 2023-10-09 19:44:37 -07:00
Zhuohan Li
ba0bfd40e2
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181) 2023-10-02 15:36:09 -07:00
Woosuk Kwon
84e4e37d14
[Minor] Fix type annotations (#1238) 2023-10-02 15:28:31 -07:00
Zhuohan Li
a60b353005
support sharding llama2-70b on more than 8 GPUs (#1209)
Co-authored-by: JiCheng <247153481@qq.com>
2023-10-02 15:26:33 -07:00
Woosuk Kwon
a8e98aee0c
Fix Mistral model (#1220) 2023-09-28 10:44:05 -07:00
Chris Bamford
bb1ba58f06
[Mistral] Mistral-7B-v0.1 support (#1196)
Co-authored-by: timlacroix <t@mistral.ai>
2023-09-28 10:41:03 -07:00
Qing
7bedab5748
Add rope_scaling to Qwen (#1210) 2023-09-28 00:49:23 -07:00
Qing
28e616c4e3
fix qwen-14b model (#1173) 2023-09-27 16:33:16 -07:00
Lily Liu
21877b0d75
Support Longchat and RoPE scaling (#555)
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-09-27 03:36:02 -07:00
Woosuk Kwon
03ffd0a022
Add comments on RoPE initialization (#1176) 2023-09-26 10:48:33 -07:00
Zhuohan Li
f187877945
[FIX] Simplify sampler logic (#1156) 2023-09-23 17:21:56 -07:00
Zhuohan Li
947b794146
[Sampler] Vectorized sampling (simplified) (#1048)
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-09-22 17:48:04 -07:00
Antoni Baum
3302f0aef3
rope_theta and max_position_embeddings from config (#1096)
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: wnma3mz <wnma3mz@gmail.com>
2023-09-20 13:35:11 -07:00
Woosuk Kwon
2b1c116b5a
Add minimum capability requirement for AWQ (#1064) 2023-09-18 12:02:01 -07:00
Woosuk Kwon
cc796b1358
Convert before transpose (#1073) 2023-09-18 11:51:48 -07:00