maximzubkov
521b35f799
Support Microsoft Phi 1.5 ( #1664 )
2023-11-16 14:28:39 -08:00
Simon Mo
cb08cd0d75
[Minor] Fix duplication of ignored seq group in engine step ( #1666 )
2023-11-16 13:11:41 -08:00
twaka
2a2c135b41
Fix loading error when safetensors contains empty tensor ( #1687 )
2023-11-16 10:38:10 -08:00
Aaron Pham
65ea2ddf17
feat(config): support parsing torch.dtype ( #1641 )
...
Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
2023-11-16 01:31:06 -08:00
Megha Agarwal
b514d3c496
Revert MptConfig to MPTConfig ( #1668 )
2023-11-16 01:19:39 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models ( #1622 )
...
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580 ).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Woosuk Kwon
660a7fcfa4
Add DeepSpeed MII backend to benchmark script ( #1649 )
2023-11-14 12:35:30 -08:00
Woosuk Kwon
054072bee5
[Minor] Move RoPE selection logic to get_rope ( #1633 )
2023-11-12 16:04:50 -08:00
lirui
eb825c1e74
Fix #1474 - AssertionError:assert param_slice.shape == loaded_weight.shape ( #1631 )
2023-11-12 15:53:12 -08:00
Dominik Schwabe
1b290ace4f
Run default _AsyncLLMEngine._run_workers_async in threadpool ( #1628 )
2023-11-11 14:50:44 -08:00
Sin
0d578228ca
config parser: add ChatGLM2 seq_length to _get_and_verify_max_len ( #1617 )
2023-11-09 19:29:51 -08:00
GhaziSyed
aebfcb262a
Dockerfile: Upgrade Cuda to 12.1 ( #1609 )
2023-11-09 11:49:02 -08:00
forpanyang
ab9e8488d5
Add Yi model to quantization support ( #1600 )
2023-11-09 11:47:14 -08:00
Woosuk Kwon
fd58b73a40
Build CUDA11.8 wheels for release ( #1596 )
2023-11-09 03:52:29 -08:00
Yanming W
8efe23f150
Fix input_metadata.selected_token_indices in worker prepare_inputs ( #1546 )
2023-11-08 14:19:12 -08:00
Zhuohan Li
06458a0b42
Upgrade to CUDA 12 ( #1527 )
...
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-08 14:17:49 -08:00
GoHomeToMacDonal
1a2bbc9301
ChatGLM Support ( #1261 )
2023-11-06 16:09:33 -08:00
Roy
e7f579eb97
Support Yi model ( #1567 )
2023-11-06 15:26:03 -08:00
Casper
8516999495
Add Quantization and AutoAWQ to docs ( #1235 )
2023-11-04 22:43:39 -07:00
Antoni Baum
9f669a9a7c
Support YaRN models ( #1264 )
...
Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Viktor Ferenczi <viktor@ferenczi.eu>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-11-03 14:12:48 -07:00
Noam Gat
555bdcc5a3
Added logits processor API to sampling params ( #1469 )
2023-11-03 14:12:15 -07:00
lots-o
54ca1ba71d
docs: add description ( #1553 )
2023-11-03 09:14:52 -07:00
Antoni Baum
9738b84a08
Force paged attention v2 for long contexts ( #1510 )
2023-11-01 16:24:32 -07:00
Woosuk Kwon
1fe0990023
Remove MPTConfig ( #1529 )
2023-11-01 15:29:05 -07:00
Fluder-Paradyne
7e90a2d117
Add /health Endpoint for both Servers ( #1540 )
2023-11-01 10:29:44 -07:00
ljss
5687d584fe
[BugFix] Set engine_use_ray=True when TP>1 ( #1531 )
2023-11-01 02:14:18 -07:00
Wenfei Yan
cf8849f2d6
Add MptForCausalLM key in model_loader ( #1526 )
2023-10-31 15:46:53 -07:00
Cade Daniel
e575df33b1
[Small] Formatter only checks lints in changed files ( #1528 )
2023-10-31 15:39:38 -07:00
Woosuk Kwon
0ce8647dc5
Fix integer overflows in attention & cache ops ( #1514 )
2023-10-31 15:19:30 -07:00
Stephen Krider
9cabcb7645
Add Dockerfile ( #1350 )
2023-10-31 12:36:47 -07:00
Zhuohan Li
7b895c5976
[Fix] Fix duplicated logging messages ( #1524 )
2023-10-31 09:04:47 -07:00
Dan Lord
7013a80170
Add support for spaces_between_special_tokens
2023-10-30 16:52:56 -07:00
Jared Roesch
79a30912b8
Add py.typed so consumers of vLLM can get type checking ( #1509 )
...
* Add py.typed so consumers of vLLM can get type checking
* Update py.typed
---------
Co-authored-by: aarnphm <29749331+aarnphm@users.noreply.github.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-30 14:50:47 -07:00
Adam Brusselback
2f3d36a8a1
Fix logging so we actually get info level entries in the log. ( #1494 )
2023-10-30 10:02:21 -07:00
iongpt
ac8d36f3e5
Refactor LLMEngine demo script for clarity and modularity ( #1413 )
...
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
2023-10-30 09:14:37 -07:00
Antoni Baum
15f5632365
Delay GPU->CPU sync in sampling ( #1337 )
2023-10-30 09:01:34 -07:00
Woosuk Kwon
aa9af07cac
Fix bias in InternLM ( #1501 )
2023-10-29 16:24:18 -07:00
ljss
69be658bba
Support repetition_penalty ( #1424 )
2023-10-29 10:02:41 -07:00
Ricardo Lu
beac8dd461
fix: don't skip first special token. ( #1497 )
2023-10-29 04:26:36 -07:00
Qing
28b47d1e49
Add rope_scaling to Aquila model ( #1457 )
2023-10-29 04:25:21 -07:00
chooper1
1f24755bf8
Support SqueezeLLM ( #1326 )
...
Co-authored-by: squeeze-ai-lab <squeezeailab.bair@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
2023-10-21 23:14:59 -07:00
Thiago Salvatore
bf31d3606a
Pin pydantic dependency versions ( #1429 )
2023-10-21 11:18:58 -07:00
Wang Ran (汪然)
d189170b6c
remove useless statements ( #1408 )
2023-10-20 08:52:07 -07:00
Light Lin
f61dc8072f
Fix type hints ( #1427 )
2023-10-20 08:50:47 -07:00
Woosuk Kwon
f8a1e39fae
[BugFix] Define __eq__ in SequenceGroupOutputs ( #1389 )
2023-10-17 01:09:44 -07:00
Wang Ran (汪然)
a132435204
Fix typo ( #1383 )
2023-10-16 21:53:37 -07:00
Woosuk Kwon
9524867701
Add Mistral 7B to test_models ( #1366 )
2023-10-16 17:49:54 -07:00
Woosuk Kwon
c1376e0f82
Change scheduler & input tensor shape ( #1381 )
2023-10-16 17:48:42 -07:00
Zhuohan Li
651c614aa4
Bump up the version to v0.2.1 ( #1355 )
2023-10-16 12:58:57 -07:00
Woosuk Kwon
d3a5bd9fb7
Fix sampler test ( #1379 )
2023-10-16 12:57:26 -07:00