TrueFoundry AI Gateway provides a universal API for all supported models via the standard OpenAI /chat/completions endpoint. This unified interface allows you to seamlessly work with models from different providers through a consistent API.

Contents

SectionDescription
Getting StartedBasic setup and configuration
Input ControlsSystem prompts and request parameters
Working with MediaImages, audio, and video support
Function CallingEnabling models to invoke functions
Response FormatStructured JSON outputs
Prompt CachingOptimize API usage with caching
Reasoning ModelsAccess model reasoning processes

Getting Started

You can use the standard OpenAI client to send requests to the gateway:
from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm/api/inference/openai" # e.g. https://my-company.truefoundry.cloud/api/llm/api/inference/openai
)

response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini", # this is the truefoundry model id
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

print(response.choices[0].message.content)

Configuration

You will need to configure the following:
  1. base_url: The base URL of the TrueFoundry dashboard
  2. api_key: API key generated from Personal Access Tokens
  3. model: TrueFoundry model ID in the format provider_account/model_name (available in the LLM playground UI)

Input Controls

System Prompts

System prompts set the behavior and context for the model by defining the assistant’s role, tone, and constraints:
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that specializes in Python programming."},
        {"role": "user", "content": "How do I write a function to calculate factorial?"}
    ]
)

Request Parameters

Fine-tune model behavior with these common parameters:
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    temperature=0.7,       # Controls randomness (0.0 to 1.0)
    max_tokens=100,        # Maximum tokens to generate
    top_p=0.9,             # Nucleus sampling parameter
    frequency_penalty=0.0, # Reduces repetition
    presence_penalty=0.0,  # Encourages new topics
    stop=["\n", "Human:"]  # Stop sequences
)
Some models don’t support all parameters. For example, temperature is not supported by o series models like o3-mini.

Working with Media

The API supports various media types including images, audio, and video.

Function Calling

Function calling allows models to invoke defined functions during conversations, enabling them to perform specific actions or retrieve external information.

Basic Usage

Define functions that the model can call:
from openai import OpenAI
import json

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm/api/inference/openai"
)

# Define a function
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["location"]
        }
    }
}]

# Make the request
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather in New York?"}],
    tools=tools
)

# Check if the model wants to call a function
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    function_name = tool_call.function.name
    function_args = json.loads(tool_call.function.arguments)

    print(f"Function called: {function_name}")
    print(f"Arguments: {function_args}")

Function Definition Reference

Implementation Workflows

Response Format

The chat completions API supports structured response formats, enabling you to receive consistent, predictable outputs in JSON format. This is useful for parsing responses programmatically.

JSON Response Options

Advanced Schema Integration

Prompt Caching

Prompt caching optimizes API usage by allowing resumption from specific prefixes in your prompts. This significantly reduces processing time and costs for repetitive tasks or prompts with consistent elements.
Currently, only Anthropic models support this caching feature.

Basic Usage

Add the cache_control parameter to any message content you want to cache:
import requests
import json

URL = "https://{controlPlaneUrl}/api/llm/chat/completions"
API_KEY = "TFY_API_KEY"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "X-TFY-METADATA": '{"tfy_log_request":"true"}'
}

payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "<TEXT_TO_CACHE>",
                    "cache_control": {
                        "type": "ephemeral"
                    }
                }
            ]
        }
    ],
    "model": "anthropic-main/claude-3-opus",
    "stream": True
}

response = requests.post(URL, headers=headers, json=payload)

Minimum Cacheable Length

ModelMinimum Token Length
Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5, Claude Opus 31024 tokens
Claude Haiku 3.5, Claude Haiku 32048 tokens
This feature is only available through direct REST API calls. The OpenAI SDK doesn’t recognize the cache_control field.

Reasoning Models

TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, currently available for Claude 3.7 Sonnet (via Anthropic, AWS Bedrock, and Google Vertex AI). These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.

Enabling Reasoning Tokens

To enable thinking/reasoning tokens, your request must include:
  1. The header: X-TFY-STRICT-OPENAI: false
  2. A thinking field in the request body
import requests
import json

url = "https://{controlPlaneUrl}/api/llm/chat/completions"
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "X-TFY-STRICT-OPENAI": "false"
}

payload = {
    "messages": [
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    "model": "anthropic/claude-3-7",
    "thinking": {
        "type": "enabled",
        "budget_tokens": 16000
    },
    "max_tokens": 18000
}

response = requests.post(url, headers=headers, json=payload)
When the X-TFY-STRICT-OPENAI header is set to false, the response is no longer OpenAI-compliant, as it introduces an additional reasoning layer that OpenAI’s compliance framework does not support.

Response Format

When reasoning tokens are enabled, the response includes both thinking and content sections:
{
  "id": "1742890579083",
  "object": "chat.completion",
  "created": 1742890579,
  "model": "",
  "provider": "aws",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": [
          {
            "type": "thinking",
            "thinking": "The user has asked a complex question about quantum mechanics. To provide a useful answer, I should first break down the core concepts and then explain them in simple terms before diving into advanced details."
          },
          {
            "type": "text",
            "text": "Quantum mechanics is a branch of physics that explains how particles behave at very small scales. Unlike classical physics, where objects have definite positions and velocities, quantum particles exist in a superposition of states until measured. Would you like a more detailed explanation or examples?"
          }
        ]
      },
      "finish_reason": "end_turn"
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 180,
    "total_tokens": 225
  }
}

Streaming with Reasoning Tokens

For streaming responses, the thinking section is always sent before the content section.

API Reference

For detailed API specifications, parameters, and response schemas, see the Chat Completions API Reference.