Batch Predictions with TrueFoundry LLM Gateway

This guide explains how to perform batch predictions using TrueFoundry's LLM Gateway with different providers.

Prerequisites

TrueFoundry API Key
Provider account configured in TrueFoundry (OpenAI or Vertex AI)
Python environment with openai library installed

Authentication

All API requests require authentication using your TrueFoundry API key and provider integration name. This is handled through the OpenAI client configuration:

from openai import OpenAI

BASE_URL = "https://internal.devtest.truefoundry.tech/api/llm"
API_KEY = "your-truefoundry-api-key"

# Configure OpenAI client with TrueFoundry settings
client = OpenAI(
    api_key=API_KEY,
    base_url=BASE_URL,
)

Provider Specific Extra Headers

When making requests, you'll need to specify provider-specific headers based on which LLM provider you're using:

OpenAI Provider Headers

extra_headers = {
    "x-tfy-provider-name": "openai-provider-name"  # name of tfy provider integration
}

Vertex AI Provider Headers

extra_headers = {
    "x-tfy-provider-name": "google-provider-name",  # name of tfy provider integration
    "x-tfy-vertex-storage-bucket-name": "your-bucket-name",
    "x-tfy-vertex-region": "your-region",  # e.g., "europe-west4"
    "x-tfy-provider-model": "gemini-2-0-flash"  # or any other supported model
}

Input File Format

The batch prediction system requires input files in JSONL (JSON Lines) format. Each line in the file must be a valid JSON object representing a single request. The file should not contain any empty lines or comments.

JSONL Format Requirements

Each line must be a valid JSON object
No empty lines between JSON objects
No trailing commas
No comments
Each line must end with a newline character

Request Format

Example of a valid JSONL file (request.jsonl):

{"custom_id": "request-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo-0125", "messages": [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}
{"custom_id": "request-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-3.5-turbo-0125", "messages": [{"role": "system", "content": "You are an unhelpful assistant."},{"role": "user", "content": "Hello world!"}],"max_tokens": 1000}}
{"custom_id": "request-3", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4-vision-preview", "messages": [{"role": "user", "content": [{"type": "text", "text": "What's in this image?"}, {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}]}], "max_tokens": 1000}}

When using Vertex AI, you can skip method, url and body.model fields since they are not used.

Workflow

The batch prediction process involves four main steps:

Upload input file
Create batch job
Check batch status
Fetch results

1. Upload Input File

Upload your JSONL file using the OpenAI client:

# Upload the input file
file = client.files.create(
    file=open("request.jsonl", "rb"),
    purpose="batch",
    extra_headers=extra_headers
)

# The response will contain the file ID needed for creating the batch job
print(file.id)  # Example: file-PnFGrFLN5LjjcWr4eFsStK

2. Create Batch Job

Create a batch job using the file ID from the upload step:

batch_job = client.batches.create(
    input_file_id=file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    extra_headers=extra_headers
)

# The response includes a batch ID for tracking
print(batch_job.id)  # Example: batch_67f7bfc50b288190893f242d9fa47c52

3. Check Batch Status

Monitor the batch job status:

batch_status = client.batches.retrieve(
    batch_job.id,
    extra_headers=extra_headers
)

print(batch_status.status)  # Example: completed, validating, in_progress, etc.

The status can be one of:

validating: Initial validation of the batch
in_progress: Processing the requests
completed: All requests processed successfully
failed: Batch processing failed

4. Fetch Results

Once the batch is completed, fetch the results:

if batch_status.status == "completed":
    output_content = client.files.content(
        batch_status.output_file_id,
        extra_headers=extra_headers
    )
    print(output_content.content)

Complete Example

Here's a complete example that puts it all together:

from openai import OpenAI

# Initialize client
client = OpenAI(
    api_key="your-api-key",
    base_url="https://internal.devtest.truefoundry.tech/api/llm",
)

extra_headers = {"x-tfy-provider-name": "openai-main"}

# 1. Upload file
file = client.files.create(
    file=open("request.jsonl", "rb"),
    purpose="batch",
    extra_headers=extra_headers
)

# 2. Create batch job
batch_job = client.batches.create(
    input_file_id=file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
    extra_headers=extra_headers
)

# 3. Check status
batch_status = client.batches.retrieve(
    batch_job.id,
    extra_headers=extra_headers
)

# 4. Fetch results when completed
if batch_status.status == "completed":
    output_content = client.files.content(
        batch_status.output_file_id,
        extra_headers=extra_headers
    )
    print(output_content.content)

Best Practices

Use meaningful custom_id values in your JSONL requests to track individual requests
Implement proper error handling around API calls
Monitor batch status regularly with appropriate polling intervals
For Vertex AI:
- Ensure proper bucket permissions
- Use appropriate region settings
- Handle URL encoding for file IDs
For OpenAI:
- Follow OpenAI's rate limits
- Use appropriate model parameters
Store API keys securely and never hardcode them in your application