squall/vllm - vllm - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
ljss	e1054247ba	[Optimization] Implement fused add rmsnorm (#1667 )	2023-11-18 18:18:02 -08:00
Woosuk Kwon	8d17774f92	Add AWQ support for all models (#1714 )	2023-11-18 17:56:47 -08:00
twaka	e946260cf3	use get_tensor in safe_open (#1696 )	2023-11-18 16:45:18 -08:00
Woosuk Kwon	bb00f66e19	Use `quantization_config` in hf config (#1695 )	2023-11-17 16:23:49 -08:00
Roy	e87557b069	Support Min P Sampler (#1642 )	2023-11-17 16:20:49 -08:00
maximzubkov	521b35f799	Support Microsoft Phi 1.5 (#1664 )	2023-11-16 14:28:39 -08:00
twaka	2a2c135b41	Fix loading error when safetensors contains empty tensor (#1687 )	2023-11-16 10:38:10 -08:00
Megha Agarwal	b514d3c496	Revert `MptConfig` to `MPTConfig` (#1668 )	2023-11-16 01:19:39 -08:00
Zhuohan Li	7076fa1c9f	TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models (#1622 ) Refactor the tensor parallelism, quantization, and weight-loading codes. Summary of the new features enabled by this PR: - All models are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580). - Model loading code became much simpler. - Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.	2023-11-15 22:50:41 -08:00
Woosuk Kwon	054072bee5	[Minor] Move RoPE selection logic to `get_rope` (#1633 )	2023-11-12 16:04:50 -08:00
lirui	eb825c1e74	Fix #1474 - AssertionError:assert param_slice.shape == loaded_weight.shape (#1631 )	2023-11-12 15:53:12 -08:00
forpanyang	ab9e8488d5	Add Yi model to quantization support (#1600 )	2023-11-09 11:47:14 -08:00
GoHomeToMacDonal	1a2bbc9301	ChatGLM Support (#1261 )	2023-11-06 16:09:33 -08:00
Roy	e7f579eb97	Support Yi model (#1567 )	2023-11-06 15:26:03 -08:00
Antoni Baum	9f669a9a7c	Support YaRN models (#1264 ) Signed-off-by: Antoni Baum <antoni.baum@protonmail.com> Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2023-11-03 14:12:48 -07:00
Noam Gat	555bdcc5a3	Added logits processor API to sampling params (#1469 )	2023-11-03 14:12:15 -07:00
Antoni Baum	9738b84a08	Force paged attention v2 for long contexts (#1510 )	2023-11-01 16:24:32 -07:00
Woosuk Kwon	1fe0990023	Remove `MPTConfig` (#1529 )	2023-11-01 15:29:05 -07:00
Wenfei Yan	cf8849f2d6	Add `MptForCausalLM` key in model_loader (#1526 )	2023-10-31 15:46:53 -07:00
Antoni Baum	15f5632365	Delay GPU->CPU sync in sampling (#1337 )	2023-10-30 09:01:34 -07:00
Woosuk Kwon	aa9af07cac	Fix bias in InternLM (#1501 )	2023-10-29 16:24:18 -07:00
ljss	69be658bba	Support repetition_penalty (#1424 )	2023-10-29 10:02:41 -07:00
Qing	28b47d1e49	Add rope_scaling to Aquila model (#1457 )	2023-10-29 04:25:21 -07:00
chooper1	1f24755bf8	Support SqueezeLLM (#1326 ) Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2023-10-21 23:14:59 -07:00
Wang Ran (汪然)	d189170b6c	remove useless statements (#1408 )	2023-10-20 08:52:07 -07:00
Wang Ran (汪然)	a132435204	Fix typo (#1383 )	2023-10-16 21:53:37 -07:00
Woosuk Kwon	c1376e0f82	Change scheduler & input tensor shape (#1381 )	2023-10-16 17:48:42 -07:00
Zhuohan Li	9d9072a069	Implement prompt logprobs & Batched topk for computing logprobs (#1328 ) Co-authored-by: Yunmo Chen <16273544+wanmok@users.noreply.github.com>	2023-10-16 10:56:50 -07:00
Woosuk Kwon	928de46888	Implement PagedAttention V2 (#1348 )	2023-10-16 00:59:57 -07:00
Lu Wang	de89472897	Fix the issue for AquilaChat2-* models (#1339 )	2023-10-13 11:51:29 -07:00
Woosuk Kwon	e7c8555d06	Bump up transformers version & Remove MistralConfig (#1254 )	2023-10-13 10:05:26 -07:00
Woosuk Kwon	875afe38ab	Add blacklist in model checkpoint (#1325 )	2023-10-12 01:05:37 -07:00
amaleshvemula	ee8217e5be	Add Mistral to quantization model list (#1278 )	2023-10-11 00:26:24 -07:00
twaka	8285736840	workaround of AWQ for Turing GPUs (#1252 )	2023-10-10 19:48:16 -07:00
yhlskt23	91fce82c6f	change the timing of sorting logits (#1309 )	2023-10-10 19:37:42 -07:00
Zhuohan Li	b95ee898fe	[Minor] Fix comment in mistral.py (#1303 )	2023-10-09 19:44:37 -07:00
Zhuohan Li	ba0bfd40e2	TP/quantization/weight loading refactor part 1 - Simplify parallel linear logic (#1181 )	2023-10-02 15:36:09 -07:00
Woosuk Kwon	84e4e37d14	[Minor] Fix type annotations (#1238 )	2023-10-02 15:28:31 -07:00
Zhuohan Li	a60b353005	support sharding llama2-70b on more than 8 GPUs (#1209 ) Co-authored-by: JiCheng <247153481@qq.com>	2023-10-02 15:26:33 -07:00
Woosuk Kwon	a8e98aee0c	Fix Mistral model (#1220 )	2023-09-28 10:44:05 -07:00
Chris Bamford	bb1ba58f06	[Mistral] Mistral-7B-v0.1 support (#1196 ) Co-authored-by: timlacroix <t@mistral.ai>	2023-09-28 10:41:03 -07:00
Qing	7bedab5748	Add rope_scaling to Qwen (#1210 )	2023-09-28 00:49:23 -07:00
Qing	28e616c4e3	fix qwen-14b model (#1173 )	2023-09-27 16:33:16 -07:00
Lily Liu	21877b0d75	Support Longchat and RoPE scaling (#555 ) Co-authored-by: Wing Lian <wing.lian@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>	2023-09-27 03:36:02 -07:00
Woosuk Kwon	03ffd0a022	Add comments on RoPE initialization (#1176 )	2023-09-26 10:48:33 -07:00
Zhuohan Li	f187877945	[FIX] Simplify sampler logic (#1156 )	2023-09-23 17:21:56 -07:00
Zhuohan Li	947b794146	[Sampler] Vectorized sampling (simplified) (#1048 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2023-09-22 17:48:04 -07:00
Antoni Baum	3302f0aef3	rope_theta and max_position_embeddings from config (#1096 ) Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: wnma3mz <wnma3mz@gmail.com>	2023-09-20 13:35:11 -07:00
Woosuk Kwon	2b1c116b5a	Add minimum capability requirement for AWQ (#1064 )	2023-09-18 12:02:01 -07:00
Woosuk Kwon	cc796b1358	Convert before transpose (#1073 )	2023-09-18 11:51:48 -07:00

1 2 3

103 Commits