diff --git a/docs/source/serving/run_on_sky.rst b/docs/source/serving/run_on_sky.rst index 2c88d24d..bd33c76c 100644 --- a/docs/source/serving/run_on_sky.rst +++ b/docs/source/serving/run_on_sky.rst @@ -1,7 +1,7 @@ .. _on_cloud: -Running on clouds with SkyPilot -=============================== +Deploying and scaling up with SkyPilot +================================================ .. raw:: html @@ -9,51 +9,75 @@ Running on clouds with SkyPilot vLLM

-vLLM can be run on the cloud to scale to multiple GPUs with `SkyPilot `__, an open-source framework for running LLMs on any cloud. +vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot `__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery `__. -To install SkyPilot and setup your cloud credentials, run: + +Prerequisites +------------- + +- Go to the `HuggingFace model page `__ and request access to the model :code:`meta-llama/Meta-Llama-3-8B-Instruct`. +- Check that you have installed SkyPilot (`docs `__). +- Check that :code:`sky check` shows clouds or Kubernetes are enabled. .. code-block:: console - $ pip install skypilot - $ sky check + pip install skypilot-nightly + sky check + + +Run on a single instance +------------------------ See the vLLM SkyPilot YAML for serving, `serving.yaml `__. .. code-block:: yaml resources: - accelerators: A100 + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. + use_spot: True + disk_size: 512 # Ensure model checkpoints can fit. + disk_tier: best + ports: 8081 # Expose to internet traffic. envs: - MODEL_NAME: decapoda-research/llama-13b-hf - TOKENIZER: hf-internal-testing/llama-tokenizer + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + HF_TOKEN: # Change to your own huggingface token, or use --env to pass. setup: | - conda create -n vllm python=3.9 -y + conda create -n vllm python=3.10 -y conda activate vllm - git clone https://github.com/vllm-project/vllm.git - cd vllm - pip install . - pip install gradio + + pip install vllm==0.4.0.post1 + # Install Gradio for web UI. + pip install gradio openai + pip install flash-attn==2.5.7 run: | conda activate vllm echo 'Starting vllm api server...' - python -u -m vllm.entrypoints.api_server \ - --model $MODEL_NAME \ - --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ - --tokenizer $TOKENIZER 2>&1 | tee api_server.log & + python -u -m vllm.entrypoints.openai.api_server \ + --port 8081 \ + --model $MODEL_NAME \ + --trust-remote-code \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + 2>&1 | tee api_server.log & + echo 'Waiting for vllm api server to start...' while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done - echo 'Starting gradio server...' - python vllm/examples/gradio_webserver.py -Start the serving the LLaMA-13B model on an A100 GPU: + echo 'Starting gradio server...' + git clone https://github.com/vllm-project/vllm.git || true + python vllm/examples/gradio_openai_chatbot_webserver.py \ + -m $MODEL_NAME \ + --port 8811 \ + --model-url http://localhost:8081/v1 \ + --stop-token-ids 128009,128001 + +Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): .. code-block:: console - $ sky launch serving.yaml + HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion. @@ -61,9 +85,226 @@ Check the output of the command. There will be a shareable gradio link (like the (task, pid=7431) Running on public URL: https://.gradio.live -**Optional**: Serve the 65B model instead of the default 13B and use more GPU: +**Optional**: Serve the 70B model instead of the default 8B and use more GPU: .. code-block:: console - sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-research/llama-65b-hf + HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct + + +Scale up to multiple replicas +----------------------------- + +SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file. + +.. code-block:: yaml + + service: + replicas: 2 + # An actual request for readiness probe. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_tokens: 1 + +.. raw:: html + +
+ Click to see the full recipe YAML + + +.. code-block:: yaml + + service: + replicas: 2 + # An actual request for readiness probe. + readiness_probe: + path: /v1/chat/completions + post_data: + model: $MODEL_NAME + messages: + - role: user + content: Hello! What is your name? + max_tokens: 1 + + resources: + accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. + use_spot: True + disk_size: 512 # Ensure model checkpoints can fit. + disk_tier: best + ports: 8081 # Expose to internet traffic. + + envs: + MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct + HF_TOKEN: # Change to your own huggingface token, or use --env to pass. + + setup: | + conda create -n vllm python=3.10 -y + conda activate vllm + + pip install vllm==0.4.0.post1 + # Install Gradio for web UI. + pip install gradio openai + pip install flash-attn==2.5.7 + + run: | + conda activate vllm + echo 'Starting vllm api server...' + python -u -m vllm.entrypoints.openai.api_server \ + --port 8081 \ + --model $MODEL_NAME \ + --trust-remote-code \ + --tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ + 2>&1 | tee api_server.log & + + echo 'Waiting for vllm api server to start...' + while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done + + echo 'Starting gradio server...' + git clone https://github.com/vllm-project/vllm.git || true + python vllm/examples/gradio_openai_chatbot_webserver.py \ + -m $MODEL_NAME \ + --port 8811 \ + --model-url http://localhost:8081/v1 \ + --stop-token-ids 128009,128001 + +.. raw:: html + +
+ +Start the serving the Llama-3 8B model on multiple replicas: + +.. code-block:: console + + HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN + + +Wait until the service is ready: + +.. code-block:: console + + watch -n10 sky serve status vllm + + +.. raw:: html + +
+ Example outputs: + +.. code-block:: console + + Services + NAME VERSION UPTIME STATUS REPLICAS ENDPOINT + vllm 1 35s READY 2/2 xx.yy.zz.100:30001 + + Service Replicas + SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION + vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP({'L4': 1}) READY us-east4 + vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP({'L4': 1}) READY us-east4 + +.. raw:: html + +
+ +After the service is READY, you can find a single endpoint for the service and access the service with the endpoint: + +.. code-block:: console + + ENDPOINT=$(sky serve status --endpoint 8081 vllm) + curl -L http://$ENDPOINT/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "meta-llama/Meta-Llama-3-8B-Instruct", + "messages": [ + { + "role": "system", + "content": "You are a helpful assistant." + }, + { + "role": "user", + "content": "Who are you?" + } + ], + "stop_token_ids": [128009, 128001] + }' + +To enable autoscaling, you could specify additional configs in `services`: + +.. code-block:: yaml + + services: + replica_policy: + min_replicas: 0 + max_replicas: 3 + target_qps_per_replica: 2 + +This will scale the service up to when the QPS exceeds 2 for each replica. + + +**Optional**: Connect a GUI to the endpoint +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + + +It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas. + +.. raw:: html + +
+ Click to see the full GUI YAML + +.. code-block:: yaml + + envs: + MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct + ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. + + resources: + cpus: 2 + + setup: | + conda activate vllm + if [ $? -ne 0 ]; then + conda create -n vllm python=3.10 -y + conda activate vllm + fi + + # Install Gradio for web UI. + pip install gradio openai + + run: | + conda activate vllm + export PATH=$PATH:/sbin + WORKER_IP=$(hostname -I | cut -d' ' -f1) + CONTROLLER_PORT=21001 + WORKER_PORT=21002 + + echo 'Starting gradio server...' + git clone https://github.com/vllm-project/vllm.git || true + python vllm/examples/gradio_openai_chatbot_webserver.py \ + -m $MODEL_NAME \ + --port 8811 \ + --model-url http://$ENDPOINT/v1 \ + --stop-token-ids 128009,128001 | tee ~/gradio.log + +.. raw:: html + +
+ +1. Start the chat web UI: + +.. code-block:: console + + sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm) + + +2. Then, we can access the GUI at the returned gradio link: + +.. code-block:: console + + | INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live +