Skip to main content
TrueFoundry AI Gateway provides a universal API for all supported models via the standard OpenAI /chat/completions endpoint. This unified interface allows you to seamlessly work with models from different providers through a consistent API.

Contents

SectionDescription
Getting StartedBasic setup and configuration
Input ControlsSystem prompts and request parameters
Working with MediaImages, audio, and video support
Function CallingEnabling models to invoke functions
Response FormatStructured JSON outputs
Prompt CachingOptimize API usage with caching
Reasoning ModelsAccess model reasoning processes

Getting Started

You can use the standard OpenAI client to send requests to the gateway:
from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm/api/inference/openai" # e.g. https://my-company.truefoundry.cloud/api/llm/api/inference/openai
)

response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini", # this is the truefoundry model id
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

print(response.choices[0].message.content)

Configuration

You will need to configure the following:
  1. base_url: The base URL of the TrueFoundry dashboard
  2. api_key: API key generated from Personal Access Tokens
  3. model: TrueFoundry model ID in the format provider_account/model_name (available in the LLM playground UI)

Input Controls

System Prompts

System prompts set the behavior and context for the model by defining the assistant’s role, tone, and constraints:
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that specializes in Python programming."},
        {"role": "user", "content": "How do I write a function to calculate factorial?"}
    ]
)

Request Parameters

Fine-tune model behavior with these common parameters:
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    temperature=0.7,       # Controls randomness (0.0 to 1.0)
    max_tokens=100,        # Maximum tokens to generate
    verbosity="high",      # Constrains verbosity: low, medium, high
    top_p=0.9,             # Nucleus sampling parameter
    frequency_penalty=0.0, # Reduces repetition
    presence_penalty=0.0,  # Encourages new topics
    stop=["\n", "Human:"]  # Stop sequences
)
Some models don’t support all parameters. For example, temperature is not supported by o series models like o3-mini.

Working with Multi Modal

The API supports various media types including images, audio, video and pdf.
Supported Models: GPT-4o, GPT-4 Vision, Claude 3, Gemini Pro VisionSend images as part of your chat completion requests using either URLs or base64 encoding:

Using Image URLs

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm"
)

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]
)

Using Base64 Encoded Images

import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image('image.jpeg')}"
                    }
                }
            ]
        }
    ]
)
Supported Models: Google Gemini models (Gemini 2.0 Flash, etc.)Send audio files in supported formats (MP3, WAV, etc.). Currently supported for Google Gemini models:

Using Audio URLs

response = client.chat.completions.create(
    model="internal-google/gemini-2-0-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/audio.wav",
                        "mime_type": "audio/wav" # required for gemini models
                    }
                }
            ]
        }
    ]
)

Using Base64 Encoded Audio

import base64

def encode_audio(audio_path):
    with open(audio_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="internal-google/gemini-2-0-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:audio/wav;base64,{encode_audio('audio.wav')}"
                    }
                }
            ]
        }
    ]
)
Supported Models: Google Gemini models (Gemini 2.0 Flash, etc.)Video processing is natively supported for Google Gemini models:

Using Video URLs

response = client.chat.completions.create(
    model="internal-google/gemini-2-0-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe what's happening in this video"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.youtube.com/watch?v=example",
                        "mime_type": "video/mp4" # required for gemini models
                    }
                }
            ]
        }
    ]
)

Using Base64 Encoded Video

import base64

def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="internal-google/gemini-2-0-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe what's happening in this video"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:video/mp4;base64,{encode_video('video.mp4')}",
                        "mime_type": "video/mp4" # required for gemini models
                    }
                }
            ]
        }
    ]
)
Supported Models: Google Gemini models (Gemini 2.5 Flash, etc.)PDF document processing allows models to analyze and extract information from PDF files:

Using Base64 Encoded PDF

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm"
)

import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="internal-google/gemini-2-5-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this pdf?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:application/pdf;base64,{encode_image('sample.pdf')}"
                    }
                }
            ]
        }
    ]
)
print(response.choices[0].message.content)

Using PDF URLs

response = client.chat.completions.create(
    model="internal-google/gemini-2-5-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Summarize this PDF document"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/document.pdf",
                        "mime_type": "application/pdf"
                    }
                }
            ]
        }
    ]
)
print(response.choices[0].message.content)

Vision

TrueFoundry supports vision models from all integrated providers as they become available. These models can analyze and interpret images alongside text, enabling multimodal AI applications.
ProviderModels
OpenAIgpt-4-vision-preview, gpt-4o, gpt-4o-mini
Anthropicclaude-3-sonnet, claude-3-haiku, claude-3-opus, claude-3.5-sonnet, claude-3.5-haiku, claude-4-oppus, claude-4-sonnet, claude-3-7-sonnet
Geminigemini-1.0-pro-vision, gemini-1.5-flash, gemini-1.5-flash-8b, gemini-1.5-pro, gemini-2.5-pro, gemini-2.5-flash
AWS Bedrockanthropic.claude-3-5-sonnet, anthropic.claude-3-5-haiku, anthropic.claude-3-5-sonnet-20240620-v1:0
Azure OpenAIgpt-4-vision-preview, gpt-4o, gpt-4o-mini

Using Vision Models with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="http://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message)

Function Calling

Function calling allows models to invoke defined functions during conversations, enabling them to perform specific actions or retrieve external information.

Basic Usage

Define functions that the model can call:
from openai import OpenAI
import json

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm/api/inference/openai"
)

# Define a function
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["location"]
        }
    }
}]

# Make the request
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather in New York?"}],
    tools=tools
)

# Check if the model wants to call a function
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    function_name = tool_call.function.name
    function_args = json.loads(tool_call.function.arguments)

    print(f"Function called: {function_name}")
    print(f"Arguments: {function_args}")

Function Definition Reference

When defining functions, you need to provide:
  • name: The function name
  • description: What the function does
  • parameters: JSON Schema object describing the parameters
function_schema = {
    "name": "get_weather",
    "description": "Get current weather information",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City name"
            }
        },
        "required": ["location"]
    }
}
Functions support various parameter types:
function_schema = {
    "name": "process_data",
    "description": "Process data with various parameters",
    "parameters": {
        "type": "object",
        "properties": {
            "text": {
                "type": "string",
                "description": "Text to process"
            },
            "count": {
                "type": "integer",
                "description": "Number of items"
            },
            "confidence": {
                "type": "number",
                "description": "Threshold (0.0 to 1.0)"
            },
            "enabled": {
                "type": "boolean",
                "description": "Whether processing is enabled"
            },
            "categories": {
                "type": "array",
                "items": {"type": "string"},
                "description": "List of categories"
            }
        },
        "required": ["text"]
    }
}

Implementation Workflows

Define multiple functions for the model to choose from:
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather information",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "max_results": {"type": "integer", "default": 5}
                },
                "required": ["query"]
            }
        }
    }
]
Process function calls and continue the conversation:
# Initial request
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=messages,
    tools=tools
)

# Handle function call
if response.choices[0].message.tool_calls:
    messages.append(response.choices[0].message)

    for tool_call in response.choices[0].message.tool_calls:
        function_name = tool_call.function.name
        function_args = json.loads(tool_call.function.arguments)

        # Execute your function (simulated here)
        if function_name == "get_weather":
            result = f"The weather in {function_args['location']} is 22°C and sunny"

        # Add the function result to the conversation
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": result
        })

    # Continue the conversation
    final_response = client.chat.completions.create(
        model="openai-main/gpt-4o-mini",
        messages=messages
    )

    print(final_response.choices[0].message.content)
Control when and how functions are called:
# Force a specific function call
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather?"}],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "get_weather"}}
)

# Allow automatic function calling (default)
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather?"}],
    tools=tools,
    tool_choice="auto"
)

# Prevent function calling
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather?"}],
    tools=tools,
    tool_choice="none"
)

# Force any function call
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather?"}],
    tools=tools,
    tool_choice="required"
)

Response Format

The chat completions API supports structured response formats, enabling you to receive consistent, predictable outputs in JSON format. This is useful for parsing responses programmatically.

JSON Response Options

JSON mode ensures the model’s output is valid JSON without enforcing a specific structure:
from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm/api/inference/openai"
)

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant designed to output JSON."},
        {"role": "user", "content": "Extract information about the 2020 World Series winner"}
    ],
    response_format={"type": "json_object"}
)

print(response.choices[0].message.content)
Output:
{
  "team": "Los Angeles Dodgers",
  "year": 2020,
  "opponent": "Tampa Bay Rays",
  "games_played": 6,
  "series_result": "4-2"
}
JSON Schema mode provides strict structure validation using predefined schemas:
from openai import OpenAI
import json

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm/api/inference/openai"
)

# Define JSON schema
user_info_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0},
        "occupation": {"type": "string"},
        "location": {"type": "string"},
        "skills": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["name", "age", "occupation", "location", "skills"],
    "additionalProperties": False
}

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "Extract user information and respond according to the provided JSON schema."
        },
        {
            "role": "user",
            "content": "My name is Sarah Johnson, I'm 28 years old, and I work as a data scientist in New York. I'm skilled in Python, SQL, and machine learning."
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "user_info",
            "schema": user_info_schema,
            "strict": True
        }
    }
)

# Parse response
result = json.loads(response.choices[0].message.content)
When using JSON schema with strict mode set to true, all properties defined in the schema must be included in the required array. If any property is defined but not marked as required, the API will return a 400 Bad Request Error.

Advanced Schema Integration

Pydantic provides automatic validation, serialization, and type hints for structured data:
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm/api/inference/openai"
)

# Define Pydantic model
class UserInfo(BaseModel):
    name: str = Field(description="Full name of the user")
    age: int = Field(ge=0, description="Age in years")
    occupation: str = Field(description="Job title or profession")
    location: str = Field(description="City or location")
    skills: List[str] = Field(description="List of professional skills")

    class Config:
        extra = "forbid"  # Prevent additional fields

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "Extract user information and respond according to the provided schema."
        },
        {
            "role": "user",
            "content": "Hi, I'm Mike Chen, a 32-year-old software architect from Seattle. I specialize in cloud computing, microservices, and Kubernetes."
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "user_info",
            "schema": UserInfo.model_json_schema(),
            "strict": True
        }
    }
)

# Parse and validate with Pydantic
user_data = UserInfo.model_validate_json(response.choices[0].message.content)
When using OpenAI models with Pydantic Models, there should not be any optional fields in the pydantic model when strict mode is true. This is because the corresponding JSON schema will have missing fields in the “required” section.
The beta parse client provides the most streamlined approach for Pydantic integration:
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional

class UserInfo(BaseModel):
    name: str = Field(description="Full name of the user")
    age: int = Field(ge=0, description="Age in years")
    occupation: str = Field(description="Job title or profession")
    location: Optional[str] = Field(None, description="City or location")
    skills: List[str] = Field(default=[], description="List of professional skills")

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm/api/inference/openai"
)

completion = client.beta.chat.completions.parse(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "Extract user information from the provided text."
        },
        {
            "role": "user",
            "content": "Hello, I'm Alex Rodriguez, a 29-year-old product manager from Austin. I have experience in agile methodologies, data analysis, and team leadership."
        }
    ],
    response_format=UserInfo,
)

user_result = completion.choices[0].message.parsed
This approach allows for optional fields in your Pydantic model and provides a cleaner API for structured responses.

Prompt Caching

Prompt caching optimizes API usage by allowing resumption from specific prefixes in your prompts. This significantly reduces processing time and costs for repetitive tasks or prompts with consistent elements.
Prompt caching is supported by multiple providers, each with their own implementation.

Supported Providers

ProviderImplementationDocumentation
OpenAIAutomatic prompt caching (KV cache)OpenAI Prompt Caching
AnthropicRequires explicit cache_control parameterAnthropic Prompt Caching
Azure OpenAIAutomatic (inherited from OpenAI)Azure OpenAI Prompt Caching
GroqAutomatic (similar to OpenAI)Groq Prompt Caching

Supported Models

Supported models: All recent models, gpt-4o and newer.Prompt caching is enabled for all recent models. You can use the prompt_cache_key parameter to improve cache hit rates when requests share common prefixes.
from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "Your prompt here"
        }
    ],
    stream=True,
    prompt_cache_key="optional-custom-key"  # OpenAI-specific parameter to improve cache hit rates
)
Supported models: Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5 (deprecated), Claude Haiku 3.5, Claude Haiku 3, Claude Opus 3 (deprecated)For Anthropic models, you must explicitly add the cache_control parameter to any message content you want to cache:
from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="anthropic-main/claude-3-7-sonnet-latest",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "<TEXT_TO_CACHE>",
                    "cache_control": {"type": "ephemeral", "ttl": "5m"},
                },
            ],
        },
        {
            "role": "user",
            "content": "Enter your prompt here",
        },
    ]
)

Minimum Cacheable Length for Anthropic

ModelMinimum Token Length
Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5, Claude Opus 31024 tokens
Claude Haiku 3.5, Claude Haiku 32048 tokens
Supported models: gpt-4o, gpt-4o-mini, gpt-4o-realtime-preview (version 2024-12-17), gpt-4o-mini-realtime-preview (version 2024-12-17), o1 (version 2024-12-17), o3-mini (version 2025-01-31)
from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="azure-openai-main/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "Your prompt here"
        }
    ],
    stream=True,
    prompt_cache_key="optional-custom-key"  # OpenAI-specific parameter to improve cache hit rates
)
Supported models: moonshotai/kimi-k2-instruct
from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="groq-main/moonshotai-kimi-k2-instruct",
    messages=[
        {
            "role": "user",
            "content": "Your prompt here"
        }
    ],
    stream=True
)

Reasoning Models

TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, available for models from multiple providers including Anthropic,OpenAI,Azure OpenAI,Groq and Vertex. These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.

Supported Reasoning Models

Supported models: o4-mini, o4-preview, o3 model family, o1 model family, gpt-5-mini, gpt-5-nano, gpt-5
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="openai-main/o4-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)
Supported models: gpt-5, gpt-5-mini, gpt-5-nano, o3-pro, codex-mini, o4-mini, o3, o3-mini, o1, o1-mini
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="azure-openai-main/o3-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)
Supported models: Claude Opus 4.1 (claude-opus-4-1-20250805), Claude Opus 4 (claude-opus-4-20250514), Claude Sonnet 4 (claude-sonnet-4-20250514), Claude Sonnet 3.7 (claude-3-7-sonnet-20250219)
via Anthropic, AWS Bedrock, and Google Vertex AI

Using OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="anthropic-main/claude-3-7-sonnet",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)
For Anthropic models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Anthropic’s native thinking parameter format since Anthropic doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:
  • low: 30% of max_tokens
  • medium: 60% of max_tokens
  • high: 90% of max_tokens

Using Direct API Calls with Native thinking Parameter

For more precise control with Anthropic models, you can use the native thinking parameter directly:
import requests
import json

url = "https://{controlPlaneUrl}/api/llm/chat/completions"
headers = {
    "Authorization": f"Bearer {TFY_API_KEY}",
}

payload = {
    "messages": [
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    "model": "anthropic-main/claude-3-7-sonnet",
    "thinking": {
        "type": "enabled",
        "budget_tokens": 6000
    },
    "max_tokens": 8000
}

response = requests.post(url, headers=headers, json=payload)
result = response.json()
print(result)
Supported models: OpenAI GPT-OSS 20B (openai/gpt-oss-20b), OpenAI GPT-OSS 120B (openai/gpt-oss-120b), Qwen 3 32B (qwen/qwen3-32b), DeepSeek R1 Distil Llama 70B (deepseek-r1-distill-llama-70b)
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="groq-main/deepseek-r1-distill-llama-70b",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)
Supported models: All Gemini 2.5 Series Models.These models can be accessed from Google Vertex or Google Gemini Providers
from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="vertex-main/gemini-2-5-pro",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)
For Gemini models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Gemini’s native thinking parameter format since Gemini doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:
  • low: 30% of max_tokens
  • medium: 60% of max_tokens
  • high: 90% of max_tokens
Note: Gemini 2.5 Pro and 2.5 Flash comes with reasoning on by default.

Using Direct API Calls with Native thinking Parameter

For more precise control with Gemini models, you can use the native thinking parameter directly:
import requests
import json

url = "https://{controlPlaneUrl}/api/llm/chat/completions"
headers = {
    "Authorization": f"Bearer {TFY_API_KEY}",
}

payload = {
    "messages": [
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    "model": "vertex-main/gemini-2-5-pro",
    "thinking": {
        "type": "enabled",
        "budget_tokens": 6000
    },
    "max_tokens": 8000
}

response = requests.post(url, headers=headers, json=payload)
result = response.json()
print(result)

Response Format

When reasoning tokens are enabled, the response includes both thinking and content sections:
{
  "id": "1742890579083",
  "object": "chat.completion",
  "created": 1742890579,
  "model": "",
  "provider": "aws",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "To compute 3^3^3, you need to understand that exponentiation is evaluated from right to left (right-associative). So:\n\n3^3^3 = 3^(3^3), not (3^3)^3\n\nHere's how to calculate it step by step:\n\n**Step 1:** Calculate the rightmost exponent first\n- 3^3 = 27\n\n**Step 2:** Now calculate 3^27\n- 3^27 = 7,625,597,484,987\n\nTherefore: **3^3^3 = 7,625,597,484,987**\n\n---\n\n**Note:** If you meant (3^3)^3 instead, that would be:\n- (3^3)^3 = 27^3 = 19,683\n\nThis is much smaller than 3^3^3 because:\n- (3^3)^3 = 3^(3×3) = 3^9\n- While 3^3^3 = 3^27\n\nThe difference between 3^9 and 3^27 is enormous!",
        "reasoning_content": "The user is asking how to compute 3^3^3. This is a question about exponentiation and specifically about the order of operations when dealing with repeated exponentiation (also known as tetration or power towers).\n\nThe key thing to understand here is that exponentiation is right-associative, meaning 3^3^3 = 3^(3^3), not (3^3)^3.\n\nSo we need to:\n1. First compute 3^3 = 27\n2. Then compute 3^27\n\nLet me work through this:\n- 3^3 = 27\n- 3^27 = 3^27\n\nNow 3^27 is a large number. Let me think about how to compute it:\n3^27 = 3^(3×9) = (3^3)^9 = 27^9\n\nOr we could compute it directly:\n3^1 = 3\n3^2 = 9\n3^3 = 27\n3^4 = 81\n3^5 = 243\n3^6 = 729\n3^7 = 2,187\n3^8 = 6,561\n3^9 = 19,683\n3^10 = 59,049\n...\n\nActually, let me just state that 3^27 = 7,625,597,484,987\n\nSo 3^3^3 = 3^(3^3) = 3^27 = 7,625,597,484,987"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 180,
    "total_tokens": 225
  }
}

Streaming with Reasoning Tokens

For streaming responses, the thinking section is always sent before the content section.

API Reference

For detailed API specifications, parameters, and response schemas, see the Chat Completions API Reference.
I