[Doc] Update SkyPilot doc for wrong indents and instructions for update service (#4283)
This commit is contained in:
parent
281977bd6e
commit
150a1ffbfd
@ -5,9 +5,9 @@ Deploying and scaling up with SkyPilot
|
|||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
|
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/>
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__.
|
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__.
|
||||||
|
|
||||||
@ -21,8 +21,8 @@ Prerequisites
|
|||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
pip install skypilot-nightly
|
pip install skypilot-nightly
|
||||||
sky check
|
sky check
|
||||||
|
|
||||||
|
|
||||||
Run on a single instance
|
Run on a single instance
|
||||||
@ -32,64 +32,64 @@ See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypil
|
|||||||
|
|
||||||
.. code-block:: yaml
|
.. code-block:: yaml
|
||||||
|
|
||||||
resources:
|
resources:
|
||||||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
|
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
|
||||||
use_spot: True
|
use_spot: True
|
||||||
disk_size: 512 # Ensure model checkpoints can fit.
|
disk_size: 512 # Ensure model checkpoints can fit.
|
||||||
disk_tier: best
|
disk_tier: best
|
||||||
ports: 8081 # Expose to internet traffic.
|
ports: 8081 # Expose to internet traffic.
|
||||||
|
|
||||||
envs:
|
envs:
|
||||||
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
|
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
|
||||||
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
|
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
|
||||||
|
|
||||||
setup: |
|
setup: |
|
||||||
conda create -n vllm python=3.10 -y
|
conda create -n vllm python=3.10 -y
|
||||||
conda activate vllm
|
conda activate vllm
|
||||||
|
|
||||||
pip install vllm==0.4.0.post1
|
pip install vllm==0.4.0.post1
|
||||||
# Install Gradio for web UI.
|
# Install Gradio for web UI.
|
||||||
pip install gradio openai
|
pip install gradio openai
|
||||||
pip install flash-attn==2.5.7
|
pip install flash-attn==2.5.7
|
||||||
|
|
||||||
run: |
|
run: |
|
||||||
conda activate vllm
|
conda activate vllm
|
||||||
echo 'Starting vllm api server...'
|
echo 'Starting vllm api server...'
|
||||||
python -u -m vllm.entrypoints.openai.api_server \
|
python -u -m vllm.entrypoints.openai.api_server \
|
||||||
--port 8081 \
|
--port 8081 \
|
||||||
--model $MODEL_NAME \
|
--model $MODEL_NAME \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
|
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
|
||||||
2>&1 | tee api_server.log &
|
2>&1 | tee api_server.log &
|
||||||
|
|
||||||
echo 'Waiting for vllm api server to start...'
|
echo 'Waiting for vllm api server to start...'
|
||||||
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
|
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
|
||||||
|
|
||||||
echo 'Starting gradio server...'
|
echo 'Starting gradio server...'
|
||||||
git clone https://github.com/vllm-project/vllm.git || true
|
git clone https://github.com/vllm-project/vllm.git || true
|
||||||
python vllm/examples/gradio_openai_chatbot_webserver.py \
|
python vllm/examples/gradio_openai_chatbot_webserver.py \
|
||||||
-m $MODEL_NAME \
|
-m $MODEL_NAME \
|
||||||
--port 8811 \
|
--port 8811 \
|
||||||
--model-url http://localhost:8081/v1 \
|
--model-url http://localhost:8081/v1 \
|
||||||
--stop-token-ids 128009,128001
|
--stop-token-ids 128009,128001
|
||||||
|
|
||||||
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
|
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...):
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
|
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN
|
||||||
|
|
||||||
Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
|
Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion.
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
|
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live
|
||||||
|
|
||||||
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
|
**Optional**: Serve the 70B model instead of the default 8B and use more GPU:
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
|
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct
|
||||||
|
|
||||||
|
|
||||||
Scale up to multiple replicas
|
Scale up to multiple replicas
|
||||||
@ -99,151 +99,212 @@ SkyPilot can scale up the service to multiple service replicas with built-in aut
|
|||||||
|
|
||||||
.. code-block:: yaml
|
.. code-block:: yaml
|
||||||
|
|
||||||
service:
|
service:
|
||||||
replicas: 2
|
replicas: 2
|
||||||
# An actual request for readiness probe.
|
# An actual request for readiness probe.
|
||||||
readiness_probe:
|
readiness_probe:
|
||||||
path: /v1/chat/completions
|
path: /v1/chat/completions
|
||||||
post_data:
|
post_data:
|
||||||
model: $MODEL_NAME
|
model: $MODEL_NAME
|
||||||
messages:
|
messages:
|
||||||
- role: user
|
- role: user
|
||||||
content: Hello! What is your name?
|
content: Hello! What is your name?
|
||||||
max_tokens: 1
|
max_tokens: 1
|
||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary>Click to see the full recipe YAML</summary>
|
<summary>Click to see the full recipe YAML</summary>
|
||||||
|
|
||||||
|
|
||||||
.. code-block:: yaml
|
.. code-block:: yaml
|
||||||
|
|
||||||
service:
|
service:
|
||||||
replicas: 2
|
replicas: 2
|
||||||
# An actual request for readiness probe.
|
# An actual request for readiness probe.
|
||||||
readiness_probe:
|
readiness_probe:
|
||||||
path: /v1/chat/completions
|
path: /v1/chat/completions
|
||||||
post_data:
|
post_data:
|
||||||
model: $MODEL_NAME
|
model: $MODEL_NAME
|
||||||
messages:
|
messages:
|
||||||
- role: user
|
- role: user
|
||||||
content: Hello! What is your name?
|
content: Hello! What is your name?
|
||||||
max_tokens: 1
|
max_tokens: 1
|
||||||
|
|
||||||
resources:
|
resources:
|
||||||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
|
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
|
||||||
use_spot: True
|
use_spot: True
|
||||||
disk_size: 512 # Ensure model checkpoints can fit.
|
disk_size: 512 # Ensure model checkpoints can fit.
|
||||||
disk_tier: best
|
disk_tier: best
|
||||||
ports: 8081 # Expose to internet traffic.
|
ports: 8081 # Expose to internet traffic.
|
||||||
|
|
||||||
envs:
|
envs:
|
||||||
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
|
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
|
||||||
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
|
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
|
||||||
|
|
||||||
setup: |
|
setup: |
|
||||||
conda create -n vllm python=3.10 -y
|
conda create -n vllm python=3.10 -y
|
||||||
conda activate vllm
|
conda activate vllm
|
||||||
|
|
||||||
pip install vllm==0.4.0.post1
|
pip install vllm==0.4.0.post1
|
||||||
# Install Gradio for web UI.
|
# Install Gradio for web UI.
|
||||||
pip install gradio openai
|
pip install gradio openai
|
||||||
pip install flash-attn==2.5.7
|
pip install flash-attn==2.5.7
|
||||||
|
|
||||||
run: |
|
run: |
|
||||||
conda activate vllm
|
conda activate vllm
|
||||||
echo 'Starting vllm api server...'
|
echo 'Starting vllm api server...'
|
||||||
python -u -m vllm.entrypoints.openai.api_server \
|
python -u -m vllm.entrypoints.openai.api_server \
|
||||||
--port 8081 \
|
--port 8081 \
|
||||||
--model $MODEL_NAME \
|
--model $MODEL_NAME \
|
||||||
--trust-remote-code \
|
--trust-remote-code \
|
||||||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
|
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
|
||||||
2>&1 | tee api_server.log &
|
2>&1 | tee api_server.log
|
||||||
|
|
||||||
echo 'Waiting for vllm api server to start...'
|
|
||||||
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done
|
|
||||||
|
|
||||||
echo 'Starting gradio server...'
|
|
||||||
git clone https://github.com/vllm-project/vllm.git || true
|
|
||||||
python vllm/examples/gradio_openai_chatbot_webserver.py \
|
|
||||||
-m $MODEL_NAME \
|
|
||||||
--port 8811 \
|
|
||||||
--model-url http://localhost:8081/v1 \
|
|
||||||
--stop-token-ids 128009,128001
|
|
||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
Start the serving the Llama-3 8B model on multiple replicas:
|
Start the serving the Llama-3 8B model on multiple replicas:
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
|
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN
|
||||||
|
|
||||||
|
|
||||||
Wait until the service is ready:
|
Wait until the service is ready:
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
watch -n10 sky serve status vllm
|
watch -n10 sky serve status vllm
|
||||||
|
|
||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary>Example outputs:</summary>
|
<summary>Example outputs:</summary>
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
Services
|
Services
|
||||||
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
|
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT
|
||||||
vllm 1 35s READY 2/2 xx.yy.zz.100:30001
|
vllm 1 35s READY 2/2 xx.yy.zz.100:30001
|
||||||
|
|
||||||
Service Replicas
|
Service Replicas
|
||||||
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
|
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION
|
||||||
vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP({'L4': 1}) READY us-east4
|
vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
|
||||||
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP({'L4': 1}) READY us-east4
|
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP([Spot]{'L4': 1}) READY us-east4
|
||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
|
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint:
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
ENDPOINT=$(sky serve status --endpoint 8081 vllm)
|
ENDPOINT=$(sky serve status --endpoint 8081 vllm)
|
||||||
curl -L http://$ENDPOINT/v1/chat/completions \
|
curl -L http://$ENDPOINT/v1/chat/completions \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
|
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
|
||||||
"messages": [
|
"messages": [
|
||||||
{
|
{
|
||||||
"role": "system",
|
"role": "system",
|
||||||
"content": "You are a helpful assistant."
|
"content": "You are a helpful assistant."
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"role": "user",
|
"role": "user",
|
||||||
"content": "Who are you?"
|
"content": "Who are you?"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"stop_token_ids": [128009, 128001]
|
"stop_token_ids": [128009, 128001]
|
||||||
}'
|
}'
|
||||||
|
|
||||||
To enable autoscaling, you could specify additional configs in `services`:
|
To enable autoscaling, you could replace the `replicas` with the following configs in `service`:
|
||||||
|
|
||||||
.. code-block:: yaml
|
.. code-block:: yaml
|
||||||
|
|
||||||
services:
|
service:
|
||||||
replica_policy:
|
replica_policy:
|
||||||
min_replicas: 0
|
min_replicas: 2
|
||||||
max_replicas: 3
|
max_replicas: 4
|
||||||
target_qps_per_replica: 2
|
target_qps_per_replica: 2
|
||||||
|
|
||||||
This will scale the service up to when the QPS exceeds 2 for each replica.
|
This will scale the service up to when the QPS exceeds 2 for each replica.
|
||||||
|
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>Click to see the full recipe YAML</summary>
|
||||||
|
|
||||||
|
|
||||||
|
.. code-block:: yaml
|
||||||
|
|
||||||
|
service:
|
||||||
|
replica_policy:
|
||||||
|
min_replicas: 2
|
||||||
|
max_replicas: 4
|
||||||
|
target_qps_per_replica: 2
|
||||||
|
# An actual request for readiness probe.
|
||||||
|
readiness_probe:
|
||||||
|
path: /v1/chat/completions
|
||||||
|
post_data:
|
||||||
|
model: $MODEL_NAME
|
||||||
|
messages:
|
||||||
|
- role: user
|
||||||
|
content: Hello! What is your name?
|
||||||
|
max_tokens: 1
|
||||||
|
|
||||||
|
resources:
|
||||||
|
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model.
|
||||||
|
use_spot: True
|
||||||
|
disk_size: 512 # Ensure model checkpoints can fit.
|
||||||
|
disk_tier: best
|
||||||
|
ports: 8081 # Expose to internet traffic.
|
||||||
|
|
||||||
|
envs:
|
||||||
|
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
|
||||||
|
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass.
|
||||||
|
|
||||||
|
setup: |
|
||||||
|
conda create -n vllm python=3.10 -y
|
||||||
|
conda activate vllm
|
||||||
|
|
||||||
|
pip install vllm==0.4.0.post1
|
||||||
|
# Install Gradio for web UI.
|
||||||
|
pip install gradio openai
|
||||||
|
pip install flash-attn==2.5.7
|
||||||
|
|
||||||
|
run: |
|
||||||
|
conda activate vllm
|
||||||
|
echo 'Starting vllm api server...'
|
||||||
|
python -u -m vllm.entrypoints.openai.api_server \
|
||||||
|
--port 8081 \
|
||||||
|
--model $MODEL_NAME \
|
||||||
|
--trust-remote-code \
|
||||||
|
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
|
||||||
|
2>&1 | tee api_server.log
|
||||||
|
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
|
To update the service with the new config:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
HF_TOKEN="your-huggingface-token" sky serve update vllm serving.yaml --env HF_TOKEN
|
||||||
|
|
||||||
|
|
||||||
|
To stop the service:
|
||||||
|
|
||||||
|
.. code-block:: console
|
||||||
|
|
||||||
|
sky serve down vllm
|
||||||
|
|
||||||
|
|
||||||
**Optional**: Connect a GUI to the endpoint
|
**Optional**: Connect a GUI to the endpoint
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
@ -253,58 +314,53 @@ It is also possible to access the Llama-3 service with a separate GUI frontend,
|
|||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
|
|
||||||
<details>
|
<details>
|
||||||
<summary>Click to see the full GUI YAML</summary>
|
<summary>Click to see the full GUI YAML</summary>
|
||||||
|
|
||||||
.. code-block:: yaml
|
.. code-block:: yaml
|
||||||
|
|
||||||
envs:
|
envs:
|
||||||
MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct
|
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct
|
||||||
ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
|
ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm.
|
||||||
|
|
||||||
resources:
|
resources:
|
||||||
cpus: 2
|
cpus: 2
|
||||||
|
|
||||||
setup: |
|
setup: |
|
||||||
conda activate vllm
|
conda create -n vllm python=3.10 -y
|
||||||
if [ $? -ne 0 ]; then
|
conda activate vllm
|
||||||
conda create -n vllm python=3.10 -y
|
|
||||||
conda activate vllm
|
|
||||||
fi
|
|
||||||
|
|
||||||
# Install Gradio for web UI.
|
# Install Gradio for web UI.
|
||||||
pip install gradio openai
|
pip install gradio openai
|
||||||
|
|
||||||
run: |
|
run: |
|
||||||
conda activate vllm
|
conda activate vllm
|
||||||
export PATH=$PATH:/sbin
|
export PATH=$PATH:/sbin
|
||||||
WORKER_IP=$(hostname -I | cut -d' ' -f1)
|
|
||||||
CONTROLLER_PORT=21001
|
echo 'Starting gradio server...'
|
||||||
WORKER_PORT=21002
|
git clone https://github.com/vllm-project/vllm.git || true
|
||||||
|
python vllm/examples/gradio_openai_chatbot_webserver.py \
|
||||||
|
-m $MODEL_NAME \
|
||||||
|
--port 8811 \
|
||||||
|
--model-url http://$ENDPOINT/v1 \
|
||||||
|
--stop-token-ids 128009,128001 | tee ~/gradio.log
|
||||||
|
|
||||||
echo 'Starting gradio server...'
|
|
||||||
git clone https://github.com/vllm-project/vllm.git || true
|
|
||||||
python vllm/examples/gradio_openai_chatbot_webserver.py \
|
|
||||||
-m $MODEL_NAME \
|
|
||||||
--port 8811 \
|
|
||||||
--model-url http://$ENDPOINT/v1 \
|
|
||||||
--stop-token-ids 128009,128001 | tee ~/gradio.log
|
|
||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
1. Start the chat web UI:
|
1. Start the chat web UI:
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
|
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
|
||||||
|
|
||||||
|
|
||||||
2. Then, we can access the GUI at the returned gradio link:
|
2. Then, we can access the GUI at the returned gradio link:
|
||||||
|
|
||||||
.. code-block:: console
|
.. code-block:: console
|
||||||
|
|
||||||
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
|
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user