Deploying LLMs

Deploying LLMs on your own cloud infrastructure in an optimal way can be complicated because of the following reasons:

  1. Choosing the most optimal model server: There are multiple options to serve LLMs ranging from using fastapi to more optimized model servers like vLLM, Triton, etc.
  2. GPU Provisioning and choosing the right GPU: Provisioning GPUs and figuring out the right GPU for your LLM requires deep infrastructure understanding and benchmarking.
  3. Configure Model Caching: LLMs are quite big in size ranging from 5Gb to 150GB. Downloading such models on every restart incurs huge amount of networking cost and leads to lower startup times. Hence, we need to configure model caching to allow multiple downloads and make autoscaling faster.
  4. Configure Autoscaling: The most common metric that works for LLM autoscaling is requests per second. We will need to setup autoscaling metrics to make sure the service can scale up as traffic increases.
  5. Hosting on multiple regions or clouds: Sometime, you might not get availability of GPUs in one cloud provider or one region. So you will need to host the LLMs in multiple regions or cloud providers. This can allow you to utilize spot instances effectively and lower your LLM hosting by 70-80%.

TrueFoundry simplifies the process of deploying LLM by automaticaly figuring out the most optimal way of deploying any LLM and configuring the correct set of GPUs. We also enable model caching by default and make it really easy to configure autoscaling based on your needs.

To deploy a LLM, you can either choose one of the models in the LLM catalogue or paste the HuggingFace LLM url. Truefoundry will try its best to figure out the details from the models page and come up with the best deployment pattern for it. This should work for most of the LLM models - but you might have to tweak in some cases in case we don't find all the information on the HuggingFace page.

Pre-requisites

Before you begin, ensure you have the following:

  • Workspace:
    To deploy your LLM, you'll need a workspace. If you don't have one, you can create it using this guide: Create a Workspace

Deploying a LLM

Let's deploy a Llama2-7B LLM through the Model Catalouge

Sending Requests to your Deployed LLM

You can send requests to each LLM through either the "Completions" endpoint or the "Chat Completions".

Note: Chat Completions Endpoint is available for models which have prompt templates for chat defined.

You can send requests to your LLM using both the normal and streaming methods.

  • Normal method: Sends the entire request at once and receives the response as a JSON object.
  • Streaming method: Sends the request as a stream of events, allowing you to process the response as it arrives. This can be done with streaming APIs or with OpenAI APIs also.

Making requests normally

You can get the code to send your request in any language using the OpenAPI tab. Follow the instructions below.

{
  "model": "<YOUR_MODEL_NAME_HERE>",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello!"
    }
  ]
}
{
  "model": "<YOUR_MODEL_NAME_HERE>",
  "prompt": "This is a test prompt"
}

You can find MODEL_NAME with a GET request on <YOUR_ENDPOINT_HERE>/v1/modelsHere is a sample response

For example, here is a Python code snippet to send a request to your LLM

# pip install requests

import json

import requests

URL = "<YOUR_ENDPOINT_HERE>/v1/chat/completions"
headers = {"Content-Type": "application/json", "Accept": "text/event-stream"}
payload = {
  "model": "<YOUR_MODEL_NAME_HERE>",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello!"
    }
  ]
}

response = requests.post(URL, json=payload, headers=headers)

print(response.json())
# pip install requests

import json

import requests

URL = "<YOUR_ENDPOINT_HERE>/v1/completions"
headers = {"Content-Type": "application/json", "Accept": "text/event-stream"}
payload = {
  "model": "llama-2-7b-chat-hf",
  "prompt": "This is a test prompt"
}

response = requests.post(URL, json=payload, headers=headers)

print(response.json())

Making requests via streaming client

Streaming requests allow you to receive the generated text from the LLM as it is being produced, without having to wait for the entire response to be generated. This method is useful for applications that require real-time processing of the generated text.


from openai import OpenAI
import json

client = OpenAI(
  api_key="TEST",
  base_url="<YOUR_ENDPOINT_HERE>/v1"
)
stream = client.chat.completions.create(
    messages = [
       {"role": "user", "content": "Enter your prompt here"},
    ],
    model= "<YOUR_MODEL_NAME_HERE>",
    stream=True,
    max_tokens=500,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Additional Configuration

Optionally, in the advanced options, you might want to