diff --git a/docs/source/getting_started/quickstart.rst b/docs/source/getting_started/quickstart.rst index 1a423b64..0a0f8f23 100644 --- a/docs/source/getting_started/quickstart.rst +++ b/docs/source/getting_started/quickstart.rst @@ -11,6 +11,14 @@ This guide shows how to use vLLM to: Be sure to complete the :ref:`installation instructions ` before continuing with this guide. +.. note:: + + By default, vLLM downloads model from `HuggingFace `_. If you would like to use models from `ModelScope `_ in the following examples, please set the environment variable: + + .. code-block:: shell + + export VLLM_USE_MODELSCOPE=True + Offline Batched Inference ------------------------- @@ -40,16 +48,6 @@ Initialize vLLM's engine for offline inference with the ``LLM`` class and the `O llm = LLM(model="facebook/opt-125m") -Use model from www.modelscope.cn - -.. code-block:: shell - - export VLLM_USE_MODELSCOPE=True - -.. code-block:: python - - llm = LLM(model="qwen/Qwen-7B-Chat", revision="v1.1.8", trust_remote_code=True) - Call ``llm.generate`` to generate the outputs. It adds the input prompts to vLLM engine's waiting queue and executes the vLLM engine to generate the outputs with high throughput. The outputs are returned as a list of ``RequestOutput`` objects, which include all the output tokens. .. code-block:: python @@ -77,16 +75,6 @@ Start the server: $ python -m vllm.entrypoints.api_server -Use model from www.modelscope.cn - -.. code-block:: console - - $ VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.api_server \ - $ --model="qwen/Qwen-7B-Chat" \ - $ --revision="v1.1.8" \ - $ --trust-remote-code - - By default, this command starts the server at ``http://localhost:8000`` with the OPT-125M model. Query the model in shell: @@ -116,13 +104,6 @@ Start the server: $ python -m vllm.entrypoints.openai.api_server \ $ --model facebook/opt-125m -Use model from www.modelscope.cn - -.. code-block:: console - - $ VLLM_USE_MODELSCOPE=True python -m vllm.entrypoints.openai.api_server \ - $ --model="qwen/Qwen-7B-Chat" --revision="v1.1.8" --trust-remote-code - By default, the server uses a predefined chat template stored in the tokenizer. You can override this template by using the ``--chat-template`` argument: .. code-block:: console