vllm/docs/source/index.rst

Welcome to vLLM!
================

.. figure:: ./assets/logos/vllm-logo-text-light.png
  :width: 60%
  :align: center
  :alt: vLLM
  :class: no-scaled-link

.. raw:: html

   <p style="text-align:center">
   <strong>Easy, fast, and cheap LLM serving for everyone
   </strong>
   </p>

   <p style="text-align:center">
   <script async defer src="https://buttons.github.io/buttons.js"></script>
   <a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>
   <a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
   <a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
   </p>


vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

* State-of-the-art serving throughput
* Efficient management of attention key and value memory with **PagedAttention**
* Continuous batching of incoming requests
* Fast model execution with CUDA/HIP graph
* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
* Optimized CUDA kernels

vLLM is flexible and easy to use with:

* Seamless integration with popular HuggingFace models
* High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
* Tensor parallelism support for distributed inference
* Streaming outputs
* OpenAI-compatible API server
* Support NVIDIA GPUs and AMD GPUs
* (Experimental) Prefix caching support
* (Experimental) Multi-lora support

For more information, check out the following:

* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
* `vLLM paper <https://arxiv.org/abs/2309.06180>`_ (SOSP 2023)
* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.
* :ref:`vLLM Meetups <meetups>`.


Documentation
-------------

.. toctree::
   :maxdepth: 1
   :caption: Getting Started

   getting_started/installation
   getting_started/amd-installation
   getting_started/cpu-installation
   getting_started/neuron-installation
   getting_started/tpu-installation
   getting_started/quickstart
   getting_started/debugging
   getting_started/examples/examples_index

.. toctree::
   :maxdepth: 1
   :caption: Serving

   serving/openai_compatible_server
   serving/deploying_with_docker
   serving/distributed_serving
   serving/metrics
   serving/env_vars
   serving/usage_stats
   serving/integrations
   serving/tensorizer

.. toctree::
   :maxdepth: 1
   :caption: Models

   models/supported_models
   models/adding_model
   models/engine_args
   models/lora
   models/vlm
   models/spec_decode
   models/performance

.. toctree::
   :maxdepth: 1
   :caption: Quantization

   quantization/auto_awq
   quantization/fp8
   quantization/fp8_e5m2_kvcache
   quantization/fp8_e4m3_kvcache

.. toctree::
   :maxdepth: 1
   :caption: Automatic Prefix Caching

   automatic_prefix_caching/apc
   automatic_prefix_caching/details

.. toctree::
   :caption: Developer Documentation

   dev/sampling_params
   dev/offline_inference/offline_index
   dev/engine/engine_index
   dev/kernel/paged_attention
   dev/multimodal/multimodal_index
   dev/dockerfile/dockerfile

.. toctree::
   :maxdepth: 1
   :caption: Community

   community/meetups
   community/sponsors

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
Change the name to vLLM (#150) 2023-06-17 18:07:40 +08:00			`Welcome to vLLM!`
			`================`
Add initial sphinx docs (#120) 2023-05-23 08:02:44 +08:00
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`.. figure:: ./assets/logos/vllm-logo-text-light.png`
			`:width: 60%`
			`:align: center`
			`:alt: vLLM`
			`:class: no-scaled-link`

			`.. raw:: html`

			`<p style="text-align:center">`
			`<strong>Easy, fast, and cheap LLM serving for everyone`
			`</strong>`
			`</p>`

			`<p style="text-align:center">`
[Minor] Fix icons in doc (#165) 2023-06-20 11:35:38 +08:00			`<script async defer src="https://buttons.github.io/buttons.js"></script>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm" data-show-count="true" data-size="large" aria-label="Star">Star</a>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>`
			`<a class="github-button" href="https://github.com/vllm-project/vllm/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`</p>`



[Docs] Minor fix (#162) 2023-06-20 10:58:23 +08:00			`vLLM is a fast and easy-to-use library for LLM inference and serving.`
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00
			`vLLM is fast with:`

			`* State-of-the-art serving throughput`
			`* Efficient management of attention key and value memory with PagedAttention`
[Doc] Documentation for distributed inference (#261) 2023-06-27 02:34:23 +08:00			`* Continuous batching of incoming requests`
[Docs] Add CUDA graph support to docs (#2148) 2023-12-17 17:49:20 +08:00			`* Fast model execution with CUDA/HIP graph`
Bump up version to v0.3.0 (#2656) 2024-01-31 16:07:07 +08:00			* Quantization: `GPTQ <https://arxiv.org/abs/2210.17323>`_, `AWQ <https://arxiv.org/abs/2306.00978>`_, `SqueezeLLM <https://arxiv.org/abs/2306.07629>`_, FP8 KV Cache
Add logo and polish readme (#156) 2023-06-19 16:31:13 +08:00			`* Optimized CUDA kernels`

			`vLLM is flexible and easy to use with:`

			`* Seamless integration with popular HuggingFace models`
			`* High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more`
			`* Tensor parallelism support for distributed inference`
			`* Streaming outputs`
			`* OpenAI-compatible API server`
[Docs] Add CUDA graph support to docs (#2148) 2023-12-17 17:49:20 +08:00			`* Support NVIDIA GPUs and AMD GPUs`
Bump up version to v0.3.0 (#2656) 2024-01-31 16:07:07 +08:00			`* (Experimental) Prefix caching support`
			`* (Experimental) Multi-lora support`
Write README and front page of doc (#147) 2023-06-18 18:19:38 +08:00
[Doc] Documentation for distributed inference (#261) 2023-06-27 02:34:23 +08:00			`For more information, check out the following:`

			* `vLLM announcing blog post <https://vllm.ai>`_ (intro to PagedAttention)
Announce paper release (#1036) 2023-09-14 08:38:13 +08:00			* `vLLM paper <https://arxiv.org/abs/2309.06180>`_ (SOSP 2023)
[Doc] Documentation for distributed inference (#261) 2023-06-27 02:34:23 +08:00			* `How continuous batching enables 23x throughput in LLM inference while reducing p50 latency <https://www.anyscale.com/blog/continuous-batching-llm-inference>`_ by Cade Daniel et al.
[Doc] Add meetups to the doc (#4798) 2024-05-14 09:48:00 +08:00			* :ref:`vLLM Meetups <meetups>`.
[Doc] Documentation for distributed inference (#261) 2023-06-27 02:34:23 +08:00
Write README and front page of doc (#147) 2023-06-18 18:19:38 +08:00
Add quickstart guide (#148) 2023-06-18 01:26:12 +08:00
Add initial sphinx docs (#120) 2023-05-23 08:02:44 +08:00			`Documentation`
			`-------------`

			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Getting Started`

			`getting_started/installation`
Merge EmbeddedLLM/vllm-rocm into vLLM main (#1836) Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Amir Balwel <amoooori04@gmail.com> Co-authored-by: root <kuanfu.liu@akirakan.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com> Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com> 2023-12-08 15:16:52 +08:00			`getting_started/amd-installation`
[Hardware][Intel] Add CPU inference backend (#3634) Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com> 2024-04-02 13:07:30 +08:00			`getting_started/cpu-installation`
[Hardware] Initial TPU integration (#5292) 2024-06-13 02:53:03 +08:00			`getting_started/neuron-installation`
			`getting_started/tpu-installation`
Add initial sphinx docs (#120) 2023-05-23 08:02:44 +08:00			`getting_started/quickstart`
[Doc] add debugging tips (#5409) 2024-06-11 14:21:43 +08:00			`getting_started/debugging`
Add example scripts to documentation (#4225) Co-authored-by: Harry Mellor <hmellor@oxts.com> 2024-04-23 00:36:54 +08:00			`getting_started/examples/examples_index`
Document supported models (#127) 2023-06-03 13:35:17 +08:00
[Doc] Documentation for distributed inference (#261) 2023-06-27 02:34:23 +08:00			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Serving`

[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-19 13:05:34 +08:00			`serving/openai_compatible_server`
Add Dockerfile (#1350) 2023-11-01 03:36:47 +08:00			`serving/deploying_with_docker`
[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-19 13:05:34 +08:00			`serving/distributed_serving`
Add Production Metrics in Prometheus format (#1890) 2023-12-03 08:37:44 +08:00			`serving/metrics`
[Doc] add env vars to the doc (#4572) 2024-05-03 13:13:49 +08:00			`serving/env_vars`
Usage Stats Collection (#2852) 2024-03-29 13:16:12 +08:00			`serving/usage_stats`
[Doc] Add docs about OpenAI compatible server (#3288) 2024-03-19 13:05:34 +08:00			`serving/integrations`
[Doc] Update documentation on Tensorizer (#5471) 2024-06-15 02:27:57 +08:00			`serving/tensorizer`
[Doc] Documentation for distributed inference (#261) 2023-06-27 02:34:23 +08:00
Document supported models (#127) 2023-06-03 13:35:17 +08:00			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Models`

			`models/supported_models`
			`models/adding_model`
[DOCS] Add engine args documentation (#1741) 2023-11-23 04:31:27 +08:00			`models/engine_args`
Add documentation section about LoRA (#2834) 2024-02-13 00:24:45 +08:00			`models/lora`
[Core] Support image processor (#4197) 2024-06-03 13:56:41 +08:00			`models/vlm`
[Speculative decoding] Initial spec decode docs (#5400) 2024-06-12 01:15:40 +08:00			`models/spec_decode`
[Doc] Chunked Prefill Documentation (#4580) 2024-05-04 15:18:00 +08:00			`models/performance`
Add Quantization and AutoAWQ to docs (#1235) 2023-11-05 13:43:39 +08:00
			`.. toctree::`
			`:maxdepth: 1`
			`:caption: Quantization`

[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`quantization/auto_awq`
[Doc] Add documentation for FP8 W8A8 (#5388) 2024-06-11 08:55:12 +08:00			`quantization/fp8`
Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU) (#3290) Co-authored-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: AdrianAbeyta <Adrian.Abeyta@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: root <root@gt-pla-u18-08.pla.dcgpu> Co-authored-by: mawong-amd <156021403+mawong-amd@users.noreply.github.com> Co-authored-by: ttbachyinsda <ttbachyinsda@outlook.com> Co-authored-by: guofangze <guofangze@kuaishou.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: jacobthebanana <50071502+jacobthebanana@users.noreply.github.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> 2024-04-04 05:15:55 +08:00			`quantization/fp8_e5m2_kvcache`
			`quantization/fp8_e4m3_kvcache`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00
			`.. toctree::`
[Core] Support image processor (#4197) 2024-06-03 13:56:41 +08:00			`:maxdepth: 1`
[Doc] Add an automatic prefix caching section in vllm documentation (#5324) Co-authored-by: simon-mo <simon.mo@hey.com> 2024-06-12 01:24:59 +08:00			`:caption: Automatic Prefix Caching`

			`automatic_prefix_caching/apc`
			`automatic_prefix_caching/details`

			`.. toctree::`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`:caption: Developer Documentation`
[Doc] Add an automatic prefix caching section in vllm documentation (#5324) Co-authored-by: simon-mo <simon.mo@hey.com> 2024-06-12 01:24:59 +08:00
[Core] Consolidate prompt arguments to LLM engines (#4328) Co-authored-by: Roger Wang <ywang@roblox.com> 2024-05-29 04:29:31 +08:00			`dev/sampling_params`
			`dev/offline_inference/offline_index`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`dev/engine/engine_index`
Add document for vllm paged attention kernel. (#2978) 2024-03-05 01:23:34 +08:00			`dev/kernel/paged_attention`
[Core] Support image processor (#4197) 2024-06-03 13:56:41 +08:00			`dev/multimodal/multimodal_index`
[Doc] add visualization for multi-stage dockerfile (#4456) Signed-off-by: Prashant Gupta <prashantgupta@us.ibm.com> Co-authored-by: Roger Wang <ywang@roblox.com> 2024-05-01 01:41:59 +08:00			`dev/dockerfile/dockerfile`
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00
[Doc] Add meetups to the doc (#4798) 2024-05-14 09:48:00 +08:00			`.. toctree::`
[Core] Support image processor (#4197) 2024-06-03 13:56:41 +08:00			`:maxdepth: 1`
[Doc] Add meetups to the doc (#4798) 2024-05-14 09:48:00 +08:00			`:caption: Community`

			`community/meetups`
[Docs] Add acknowledgment for sponsors (#4925) 2024-05-21 15:17:25 +08:00			`community/sponsors`
[Doc] Add meetups to the doc (#4798) 2024-05-14 09:48:00 +08:00
[DOC] Add additional comments for LLMEngine and AsyncLLMEngine (#1011) 2024-01-12 11:26:49 +08:00			`Indices and tables`
			`==================`

			* :ref:`genindex`
			* :ref:`modindex`