ljss
e1054247ba
[Optimization] Implement fused add rmsnorm ( #1667 )
2023-11-18 18:18:02 -08:00
Woosuk Kwon
8d17774f92
Add AWQ support for all models ( #1714 )
2023-11-18 17:56:47 -08:00
twaka
e946260cf3
use get_tensor in safe_open ( #1696 )
2023-11-18 16:45:18 -08:00
Woosuk Kwon
bb00f66e19
Use quantization_config in hf config ( #1695 )
2023-11-17 16:23:49 -08:00
Roy
e87557b069
Support Min P Sampler ( #1642 )
2023-11-17 16:20:49 -08:00
maximzubkov
521b35f799
Support Microsoft Phi 1.5 ( #1664 )
2023-11-16 14:28:39 -08:00
twaka
2a2c135b41
Fix loading error when safetensors contains empty tensor ( #1687 )
2023-11-16 10:38:10 -08:00
Megha Agarwal
b514d3c496
Revert MptConfig to MPTConfig ( #1668 )
2023-11-16 01:19:39 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models ( #1622 )
...
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580 ).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Woosuk Kwon
054072bee5
[Minor] Move RoPE selection logic to get_rope ( #1633 )
2023-11-12 16:04:50 -08:00
lirui
eb825c1e74
Fix #1474 - AssertionError:assert param_slice.shape == loaded_weight.shape ( #1631 )
2023-11-12 15:53:12 -08:00
forpanyang
ab9e8488d5
Add Yi model to quantization support ( #1600 )
2023-11-09 11:47:14 -08:00
GoHomeToMacDonal
1a2bbc9301
ChatGLM Support ( #1261 )
2023-11-06 16:09:33 -08:00
Roy
e7f579eb97
Support Yi model ( #1567 )
2023-11-06 15:26:03 -08:00
Antoni Baum
9f669a9a7c
Support YaRN models ( #1264 )
...
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-03 14:12:48 -07:00
Noam Gat
555bdcc5a3
Added logits processor API to sampling params ( #1469 )
2023-11-03 14:12:15 -07:00
Antoni Baum
9738b84a08
Force paged attention v2 for long contexts ( #1510 )
2023-11-01 16:24:32 -07:00
Woosuk Kwon
1fe0990023
Remove MPTConfig ( #1529 )
2023-11-01 15:29:05 -07:00
Wenfei Yan
cf8849f2d6
Add MptForCausalLM key in model_loader ( #1526 )
2023-10-31 15:46:53 -07:00
Antoni Baum
15f5632365
Delay GPU->CPU sync in sampling ( #1337 )
2023-10-30 09:01:34 -07:00
Woosuk Kwon
aa9af07cac
Fix bias in InternLM ( #1501 )
2023-10-29 16:24:18 -07:00
ljss
69be658bba
Support repetition_penalty ( #1424 )
2023-10-29 10:02:41 -07:00
Qing
28b47d1e49
Add rope_scaling to Aquila model ( #1457 )
2023-10-29 04:25:21 -07:00
chooper1
1f24755bf8
Support SqueezeLLM ( #1326 )
...
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Wang Ran (汪然)
d189170b6c
remove useless statements ( #1408 )
2023-10-20 08:52:07 -07:00
Wang Ran (汪然)
a132435204
Fix typo ( #1383 )
2023-10-16 21:53:37 -07:00
Woosuk Kwon
c1376e0f82
Change scheduler & input tensor shape ( #1381 )
2023-10-16 17:48:42 -07:00
Zhuohan Li
9d9072a069
Implement prompt logprobs & Batched topk for computing logprobs ( #1328 )
...
Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>
2023-10-16 10:56:50 -07:00
Woosuk Kwon
928de46888
Implement PagedAttention V2 ( #1348 )
2023-10-16 00:59:57 -07:00
Lu Wang
de89472897
Fix the issue for AquilaChat2-* models ( #1339 )
2023-10-13 11:51:29 -07:00
Woosuk Kwon
e7c8555d06
Bump up transformers version & Remove MistralConfig ( #1254 )
2023-10-13 10:05:26 -07:00
Woosuk Kwon
875afe38ab
Add blacklist in model checkpoint ( #1325 )
2023-10-12 01:05:37 -07:00
amaleshvemula
ee8217e5be
Add Mistral to quantization model list ( #1278 )
2023-10-11 00:26:24 -07:00
twaka
8285736840
workaround of AWQ for Turing GPUs ( #1252 )
2023-10-10 19:48:16 -07:00
yhlskt23
91fce82c6f
change the timing of sorting logits ( #1309 )
2023-10-10 19:37:42 -07:00
Zhuohan Li
b95ee898fe
[Minor] Fix comment in mistral.py ( #1303 )
2023-10-09 19:44:37 -07:00
Zhuohan Li
ba0bfd40e2
TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic ( #1181 )
2023-10-02 15:36:09 -07:00
Woosuk Kwon
84e4e37d14
[Minor] Fix type annotations ( #1238 )
2023-10-02 15:28:31 -07:00
Zhuohan Li
a60b353005
support sharding llama2-70b on more than 8 GPUs ( #1209 )
...
Co-authored-by: JiCheng <247153481@qq.com>
2023-10-02 15:26:33 -07:00
Woosuk Kwon
a8e98aee0c
Fix Mistral model ( #1220 )
2023-09-28 10:44:05 -07:00
Chris Bamford
bb1ba58f06
[Mistral] Mistral-7B-v0.1 support ( #1196 )
...
Co-authored-by: timlacroix <t@mistral.ai>
2023-09-28 10:41:03 -07:00
Qing
7bedab5748
Add rope_scaling to Qwen ( #1210 )
2023-09-28 00:49:23 -07:00
Qing
28e616c4e3
fix qwen-14b model ( #1173 )
2023-09-27 16:33:16 -07:00
Lily Liu
21877b0d75
Support Longchat and RoPE scaling ( #555 )
...
Co-authored-by: Wing Lian <wing.lian@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-09-27 03:36:02 -07:00
Woosuk Kwon
03ffd0a022
Add comments on RoPE initialization ( #1176 )
2023-09-26 10:48:33 -07:00
Zhuohan Li
f187877945
[FIX] Simplify sampler logic ( #1156 )
2023-09-23 17:21:56 -07:00
Zhuohan Li
947b794146
[Sampler] Vectorized sampling (simplified) ( #1048 )
...
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2023-09-22 17:48:04 -07:00
Antoni Baum
3302f0aef3
rope_theta and max_position_embeddings from config ( #1096 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: wnma3mz <wnma3mz@gmail.com>
2023-09-20 13:35:11 -07:00
Woosuk Kwon
2b1c116b5a
Add minimum capability requirement for AWQ ( #1064 )
2023-09-18 12:02:01 -07:00
Woosuk Kwon
cc796b1358
Convert before transpose ( #1073 )
2023-09-18 11:51:48 -07:00