[Doc] Added cerebrium as Integration option (#5553)

2024-06-18 18:56:59 -04:00 · 2024-06-18 18:56:59 -04:00 · 2bd231a7b7
commit 2bd231a7b7
parent 8a173382c8
2 changed files with 110 additions and 0 deletions
--- a/docs/source/serving/deploying_with_cerebrium.rst
+++ b/docs/source/serving/deploying_with_cerebrium.rst
@ -0,0 +1,109 @@
 .. _deploying_with_cerebrium:
 Deploying with Cerebrium
 ============================
 .. raw:: html
    <p align="center">
        <img src="https://i.ibb.co/hHcScTT/Screenshot-2024-06-13-at-10-14-54.png" alt="vLLM_plus_cerebrium"/>
    </p>
 vLLM can be run on a cloud based GPU machine with `Cerebrium <https://www.cerebrium.ai/>`__, a serverless AI infrastructure platform that makes it easier for companies to build and deploy AI based applications.
 To install the Cerebrium client, run:
 .. code-block:: console
    $ pip install cerebrium
    $ cerebrium login
 Next, create your Cerebrium project, run:
 .. code-block:: console
    $ cerebrium init vllm-project
 Next, to install the required packages, add the following to your cerebrium.toml:
 .. code-block:: toml
    [cerebrium.dependencies.pip]
    vllm = "latest"
 Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py`:
 .. code-block:: python
    from vllm import LLM, SamplingParams
    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
    def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
        sampling_params = SamplingParams(temperature=temperature, top_p=top_p)
        outputs = llm.generate(prompts, sampling_params)
        # Print the outputs.
        results = []
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            results.append({"prompt": prompt, "generated_text": generated_text})
        return {"results": results}
 Then, run the following code to deploy it to the cloud
 .. code-block:: console
    $ cerebrium deploy
 If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
 .. code-block:: python
    curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
     -H 'Content-Type: application/json' \
     -H 'Authorization: <JWT TOKEN>' \
     --data '{
       "prompts": [
         "Hello, my name is",
         "The president of the United States is",
         "The capital of France is",
         "The future of AI is"
       ]
     }'
 You should get a response like:
 .. code-block:: python
    {
        "run_id": "52911756-3066-9ae8-bcc9-d9129d1bd262",
        "result": {
            "result": [
                {
                    "prompt": "Hello, my name is",
                    "generated_text": " Sarah, and I'm a teacher. I teach elementary school students. One of"
                },
                {
                    "prompt": "The president of the United States is",
                    "generated_text": " elected every four years. This is a democratic system.\n\n5. What"
                },
                {
                    "prompt": "The capital of France is",
                    "generated_text": " Paris.\n"
                },
                {
                    "prompt": "The future of AI is",
                    "generated_text": " bright, but it's important to approach it with a balanced and nuanced perspective."
                }
            ]
        },
        "run_time_ms": 152.53663063049316
    }
 You now have an autoscaling endpoint where you only pay for the compute you use!
--- a/docs/source/serving/integrations.rst
+++ b/docs/source/serving/integrations.rst
@ -8,6 +8,7 @@ Integrations
   deploying_with_kserve
   deploying_with_triton
   deploying_with_bentoml
   deploying_with_cerebrium
   deploying_with_lws
   deploying_with_dstack
   serving_with_langchain