diff --git a/docs/source/serving/integrations.rst b/docs/source/serving/integrations.rst index 7882e14f..f39997e0 100644 --- a/docs/source/serving/integrations.rst +++ b/docs/source/serving/integrations.rst @@ -13,3 +13,4 @@ Integrations deploying_with_dstack serving_with_langchain serving_with_llamaindex + serving_with_llamastack diff --git a/docs/source/serving/serving_with_llamastack.rst b/docs/source/serving/serving_with_llamastack.rst new file mode 100644 index 00000000..8ef96c4e --- /dev/null +++ b/docs/source/serving/serving_with_llamastack.rst @@ -0,0 +1,42 @@ +.. _run_on_llamastack: + +Serving with Llama Stack +============================ + +vLLM is also available via `Llama Stack `_ . + +To install Llama Stack, run + +.. code-block:: console + + $ pip install llama-stack -q + +Inference using OpenAI Compatible API +------------------------------------- + +Then start Llama Stack server pointing to your vLLM server with the following configuration: + +.. code-block:: yaml + + inference: + - provider_id: vllm0 + provider_type: remote::vllm + config: + url: http://127.0.0.1:8000 + +Please refer to `this guide `_ for more details on this remote vLLM provider. + +Inference via Embedded vLLM +--------------------------- + +An `inline vLLM provider +`_ +is also available. This is a sample of configuration using that method: + +.. code-block:: yaml + + inference + - provider_type: vllm + config: + model: Llama3.1-8B-Instruct + tensor_parallel_size: 4