Simon Mo
0f621c2c7d
[Docs] Add information about using shared memory in docker ( #1845 )
2023-11-29 18:33:56 -08:00
Woosuk Kwon
a9e4574261
Refactor Attention ( #1840 )
2023-11-29 15:37:31 -08:00
FlorianJoncour
0229c386c5
Better integration with Ray Serve ( #1821 )
...
Co-authored-by: FlorianJoncour <florian@zetta-sys.com>
2023-11-29 13:25:43 -08:00
Woosuk Kwon
a7b3e33078
[Fix] Fix RoPE in ChatGLM-32K ( #1841 )
2023-11-29 13:01:19 -08:00
Zhuohan Li
e19a64c7ef
[FIX] Fix formatting error in main branch ( #1822 )
2023-11-28 16:56:43 -08:00
Zhuohan Li
1cb4ad8de9
[FIX] Fix formatting error
2023-11-29 00:40:19 +00:00
explainerauthors
6ed068a71a
Use the type BlockTable ( #1791 )
2023-11-28 16:34:05 -08:00
Zhuohan Li
708e6c18b0
[FIX] Fix class naming ( #1803 )
2023-11-28 14:08:01 -08:00
Woosuk Kwon
b943890484
Fix OPT param names ( #1819 )
2023-11-28 11:22:44 -08:00
explainerauthors
a1125ad4df
Correct comments in parallel_state.py ( #1818 )
2023-11-28 10:19:35 -08:00
ljss
a8b150c595
Init model on GPU to reduce CPU memory footprint ( #1796 )
2023-11-27 11:18:26 -08:00
Yunmo Chen
665cbcec4b
Added echo function to OpenAI API server. ( #1504 )
2023-11-26 21:29:17 -08:00
Woosuk Kwon
7c600440f7
Fix model docstrings ( #1764 )
2023-11-23 23:04:44 -08:00
Yanming W
e0c6f556e8
[Build] Avoid building too many extensions ( #1624 )
2023-11-23 16:31:19 -08:00
ljss
de23687d16
Fix repetition penalty aligned with huggingface ( #1577 )
2023-11-22 14:41:44 -08:00
ljss
4cea74c73b
Set top_p=0 and top_k=-1 in greedy sampling ( #1748 )
2023-11-22 12:51:09 -08:00
Casper
a921d8be9d
[DOCS] Add engine args documentation ( #1741 )
2023-11-22 12:31:27 -08:00
陈序
094f716bf2
Add stop_token_ids in SamplingParams.__repr__ ( #1745 )
2023-11-21 20:13:53 -08:00
Zhuohan Li
7d761fe3c1
[FIX] Fix the case when input_is_parallel=False for ScaledActivation ( #1737 )
2023-11-20 23:56:48 -08:00
Woosuk Kwon
cf35d8f3d7
[BugFix] Fix TP support for AWQ ( #1731 )
2023-11-20 21:42:45 -08:00
boydfd
4bb6b67188
fix RAM OOM when load large models in tensor parallel mode. ( #1395 )
...
Co-authored-by: ran_lin <rlin@thoughtworks.com>
2023-11-20 19:02:42 -08:00
ljss
819b18e7ba
Rewrite torch.repeat_interleave to remove cpu synchronization ( #1599 )
2023-11-20 17:46:32 -08:00
Zhuofan
19849db573
[Fix] Fix bugs in scheduler ( #1727 )
2023-11-20 16:10:50 -08:00
陈序
3d4ceb292c
Fix hanging in the scheduler caused by long prompts ( #1534 )
2023-11-20 16:06:49 -08:00
Woosuk Kwon
f5a37c6c6c
[BugFix] Fix a bug in loading safetensors ( #1732 )
2023-11-20 15:51:18 -08:00
Zhuohan Li
32c927b53f
[FIX] Update the doc link in README.md ( #1730 )
2023-11-20 12:46:24 -08:00
Simon Mo
5ffc0d13a2
Migrate linter from pylint to ruff ( #1665 )
2023-11-20 11:58:01 -08:00
Wen Sun
112627e8b2
[Docs] Fix the code block's format in deploying_with_docker page ( #1722 )
2023-11-20 01:22:39 -08:00
Simon Mo
37c1e3c218
Documentation about official docker image ( #1709 )
2023-11-19 20:56:26 -08:00
Woosuk Kwon
06e9ebebd5
Add instructions to install vLLM+cu118 ( #1717 )
2023-11-18 23:48:58 -08:00
Woosuk Kwon
c5f7740d89
Bump up to v0.2.2 ( #1689 )
2023-11-18 21:57:07 -08:00
Woosuk Kwon
be66d9b125
Fix warning msg on quantization ( #1715 )
2023-11-18 21:49:55 -08:00
ljss
e1054247ba
[Optimization] Implement fused add rmsnorm ( #1667 )
2023-11-18 18:18:02 -08:00
Woosuk Kwon
8d17774f92
Add AWQ support for all models ( #1714 )
2023-11-18 17:56:47 -08:00
twaka
e946260cf3
use get_tensor in safe_open ( #1696 )
2023-11-18 16:45:18 -08:00
liuyhwangyh
edb305584b
Support download models from www.modelscope.cn ( #1588 )
2023-11-17 20:38:31 -08:00
Woosuk Kwon
bb00f66e19
Use quantization_config in hf config ( #1695 )
2023-11-17 16:23:49 -08:00
Roy
e87557b069
Support Min P Sampler ( #1642 )
2023-11-17 16:20:49 -08:00
Zhuofan
dcc543a298
[Minor] Fix comment ( #1704 )
2023-11-17 09:42:49 -08:00
Zhuohan Li
0fc280b06c
Update the adding-model doc according to the new refactor ( #1692 )
2023-11-16 18:46:26 -08:00
Zhuohan Li
20d0699d49
[Fix] Fix comm test ( #1691 )
2023-11-16 16:28:39 -08:00
Iskren Ivov Chernev
686f5e3210
Return usage for openai streaming requests ( #1663 )
2023-11-16 15:28:36 -08:00
Zhuohan Li
415d109527
[Fix] Update Supported Models List ( #1690 )
2023-11-16 14:47:26 -08:00
maximzubkov
521b35f799
Support Microsoft Phi 1.5 ( #1664 )
2023-11-16 14:28:39 -08:00
Simon Mo
cb08cd0d75
[Minor] Fix duplication of ignored seq group in engine step ( #1666 )
2023-11-16 13:11:41 -08:00
twaka
2a2c135b41
Fix loading error when safetensors contains empty tensor ( #1687 )
2023-11-16 10:38:10 -08:00
Aaron Pham
65ea2ddf17
feat(config): support parsing torch.dtype ( #1641 )
...
Signed-off-by: Aaron <29749331+aarnphm@users.noreply.github.com>
2023-11-16 01:31:06 -08:00
Megha Agarwal
b514d3c496
Revert MptConfig to MPTConfig ( #1668 )
2023-11-16 01:19:39 -08:00
Zhuohan Li
7076fa1c9f
TP/quantization/weight loading refactor part 2 - Refactor quantized linear logic and extend quantization support to all models ( #1622 )
...
Refactor the tensor parallelism, quantization, and weight-loading codes.
Summary of the new features enabled by this PR:
- **All models** are able to be quantized with AWQ and SqueezeLLM, and [soon GPTQ](https://github.com/vllm-project/vllm/pull/1580 ).
- Model loading code became much simpler.
- Support model parallelism for all MQA/GQA models when the number of key/value heads is smaller than the tensor parallel size.
2023-11-15 22:50:41 -08:00
Woosuk Kwon
660a7fcfa4
Add DeepSpeed MII backend to benchmark script ( #1649 )
2023-11-14 12:35:30 -08:00