Go to file
2023-06-18 03:19:38 -07:00
assets/figures Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00
benchmarks Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
csrc Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
docs Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00
examples Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
tests/kernels Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
vllm Reduce GPU memory utilization to make sure OOM doesn't happen (#153) 2023-06-18 17:33:50 +08:00
.gitignore [PyPI] Packaging for PyPI distribution (#140) 2023-06-05 20:03:14 -07:00
.readthedocs.yaml Add .readthedocs.yaml (#136) 2023-06-02 22:27:44 -07:00
CONTRIBUTING.md Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
LICENSE Add Apache-2.0 license (#102) 2023-05-14 18:05:19 -07:00
MANIFEST.in [PyPI] Packaging for PyPI distribution (#140) 2023-06-05 20:03:14 -07:00
mypy.ini Change the name to vLLM (#150) 2023-06-17 03:07:40 -07:00
pyproject.toml [PyPI] Packaging for PyPI distribution (#140) 2023-06-05 20:03:14 -07:00
README.md Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00
requirements-dev.txt Add contributing guideline and mypy config (#122) 2023-05-23 17:58:51 -07:00
requirements.txt OpenAI Compatible Frontend (#116) 2023-05-23 21:39:50 -07:00
setup.py Write README and front page of doc (#147) 2023-06-18 03:19:38 -07:00

vLLM: Easy, Fast, and Cheap LLM Serving for Everyone

| Documentation | Blog |

vLLM is a fast and easy-to-use library for LLM inference and serving.

Latest News 🔥

Getting Started

Visit our documentation to get started.

Key Features

vLLM comes with many powerful features that include:

  • State-of-the-art performance in serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Seamless integration with popular HuggingFace models
  • Dynamic batching of incoming requests
  • Optimized CUDA kernels
  • High-throughput serving with various decoding algorithms, including parallel sampling and beam search
  • Tensor parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server

Performance

vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput. For details, check out our blog post.


Serving throughput when each request asks for 1 output completion.


Serving throughput when each request asks for 3 output completions.

Contributing

We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.