Go to file
2023-06-20 10:57:46 +08:00
benchmarks Remove benchmark_async_llm_server.py (#155) 2023-06-19 11:12:37 +08:00
csrc Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
docs Add and list supported models in README (#161) 2023-06-20 10:57:46 +08:00
examples Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
tests/kernels Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
vllm [Minor] Fix CompletionOutput.__repr__ (#157) 2023-06-18 19:58:25 -07:00
.gitignore Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
.readthedocs.yaml Add .readthedocs.yaml (#136) 2023-06-02 22:27:44 -07:00
CONTRIBUTING.md Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
LICENSE Add Apache-2.0 license (#102) 2023-05-14 18:05:19 -07:00
MANIFEST.in [PyPI] Packaging for PyPI distribution (#140) 2023-06-05 20:03:14 -07:00
mypy.ini Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
pyproject.toml [PyPI] Packaging for PyPI distribution (#140) 2023-06-05 20:03:14 -07:00
README.md Add and list supported models in README (#161) 2023-06-20 10:57:46 +08:00
requirements-dev.txt Add contributing guideline and mypy config (#122) 2023-05-23 17:58:51 -07:00
requirements.txt OpenAI Compatible Frontend (#116) 2023-05-23 21:39:50 -07:00
setup.py [PyPI] Fix package info in setup.py (#158) 2023-06-19 18:05:01 -07:00

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog |


Latest News 🔥


vLLM is a fast and easy to use library for LLM inference and serving.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Dynamic batching of incoming requests
  • Optimized CUDA kernels

vLLM is flexible and easy to use with:

  • Seamless integration with popular HuggingFace models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server

vLLM seamlessly supports many Huggingface models, including the following architectures:

  • GPT-2 (e.g., gpt2, gpt2-xl, etc.)
  • GPTNeoX (e.g., EleutherAI/gpt-neox-20b, databricks/dolly-v2-12b, stabilityai/stablelm-tuned-alpha-7b, etc.)
  • LLaMA (e.g., lmsys/vicuna-13b-v1.3, young-geng/koala, openlm-research/open_llama_13b, etc.)
  • OPT (e.g., facebook/opt-66b, facebook/opt-iml-max-30b, etc.)

Install vLLM with pip or from source:

pip install vllm

Getting Started

Visit our documentation to get started.

Performance

vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput. For details, check out our blog post.


Serving throughput when each request asks for 1 output completion.


Serving throughput when each request asks for 3 output completions.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.