Overview

When we interact with the prompt in the AI gateway playground, the playground UI renders the tool calls, their arguments, results and the LLM responses as they are streamed back from the gateway. If you want to do the same in your own application, or a different UI apart from the TrueFoundry playground, you can use the Agent API described below.

Quickstart

Get started with the Agent API in 3 simple steps:

Set your API token and base URL

export TFY_API_TOKEN=your-token-here
export TFY_CONTROL_PLANE_BASE_URL=https://your-truefoundry-instance.com
See Authentication for details on getting your token.

Make your first request

http POST https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions \
  Authorization:"Bearer ${TFY_API_TOKEN}" \
  Content-Type:application/json \
  model=openai/gpt-4o stream:=true \
  messages:='[{"role":"user","content":"Help me find a model that can generate images"}]' \
  mcp_servers:='[{"integration_fqn":"common-tools","enable_all_tools":false,"tools":[{"name":"web_search"}]}]'

Understand the response

You’ll receive a streaming response with:
  • Assistant content: The LLM’s text response
  • Tool calls: When the assistant decides to use tools (like web search)
  • Tool results: Output from executed tools
  • Follow-up: The assistant processes tool results and continues
The agent will automatically use the web_search tool to find image generation models and provide recommendations.

Request examples

Call with registered MCP servers

When you have MCP servers already registered in your TrueFoundry AI Gateway, you can reference them using their integration_fqn:
http POST https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions \
  Authorization:"Bearer ${TFY_API_TOKEN}" \
  Content-Type:application/json \
  model=openai/gpt-4o stream:=true \
  messages:='[{"role":"user","content":"Search for Python tutorials and run a simple code example"}]' \
  mcp_servers:='[{"integration_fqn":"common-tools","enable_all_tools":false,"tools":[{"name":"web_search"},{"name":"code_executor"}]}]'

Use external MCP servers

You can connect to any MCP server accessible without pre-registering it in the gateway:
http POST https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions \
  Authorization:"Bearer ${TFY_API_TOKEN}" \
  Content-Type:application/json \
  model=openai/gpt-4o stream:=true \
  messages:='[{"role":"user","content":"Search for machine learning models and packages"}]' \
  mcp_servers:='[{"url":"https://huggingface.co/mcp","enable_all_tools":false,"headers":{"Authorization":"Bearer <huggingface-token>"},"tools":[{"name":"model_search"}]}]'

Override auth headers

You can override authentication per MCP server entry using the headers field. Works for both registered servers (integration_fqn) and external servers (url).
http POST https://${TFY_CONTROL_PLANE_BASE_URL}/api/llm/agent/chat/completions \
  Authorization:"Bearer ${TFY_API_TOKEN}" \
  Content-Type:application/json \
  model=openai/gpt-4o stream:=true \
  messages:='[{"role":"user","content":"Think step by step and search for information about Python"}]' \
  mcp_servers:='[
    {
      "integration_fqn":"common-tools",
      "enable_all_tools": false,
      "tools": [{"name":"sequential_thinking"},{"name":"web_search"}],
      "headers": {"X-API-Key":"<custom-api-key>", "X-Custom-Header":"custom-value"}
    }
  ]'

API Reference

Request parameters

Request Parameters

ParameterTypeRequiredDefaultDescription
modelstring-The LLM model to use (e.g., “openai/gpt-4o”)
messagesarray-Array of message objects with role and content
mcp_serversarray-Array of MCP Server configurations (see below)
max_tokensnumber-Maximum number of tokens to generate
temperaturenumber-Controls randomness in the response (0.0 to 2.0)
top_pnumber-Nucleus sampling parameter (0.0 to 1.0)
top_knumber-Top-k sampling parameter
streamboolean-Whether to stream responses (only true is supported)
iteration_limitnumber5Maximum tool call iterations (1-20)
About tool call iterations: An iteration represents a full loop of user → model → tool call → tool result → model. The iteration_limit sets the maximum number of such loops per request to prevent runaway chains.

MCP server configuration

Each entry in the mcp_servers array should include:

MCP Server Parameters

ParameterTypeRequiredDefaultDescription
integration_fqnstring✗*-Fully qualified name of the MCP Server integration
urlstring✗*-URL of the MCP server (must be valid URL)
headersobject-HTTP headers to send to the MCP server
enable_all_toolsbooleantrueWhether to enable all tools for this server
toolsarray-Array of specific tools to enable
*Note: Either integration_fqn or url must be provided, but not both.

Tool configuration

Each entry in the tools array should include:

Tool Parameters

ParameterTypeRequiredDescription
namestringThe name of the tool as it appears in the MCP server

Streaming Response

The Chat API uses Server-Sent Events (SSE) to stream responses in real-time. This includes assistant text, tool calls (function names and their arguments), and tool results.
Both assistant content and tool call arguments are streamed incrementally across multiple chunks. You must accumulate these fragments to build complete responses.
Compatibility: The streaming format follows OpenAI Chat Completions streaming semantics. See the official guide: OpenAI streaming responses. In addition, the Gateway emits tool result chunks as extra delta events (with role: "tool", tool_call_id, and content) to carry tool outputs.

Quick Reference

Event Quick Reference

EventRelevant FieldsDescription
Contentdelta.role (first or every chunk), delta.contentAssistant text streamed over multiple chunks
Tool Call (start)delta.tool_calls[].function.name, delta.tool_calls[].idAnnounces a function call and its id
Tool Call (args)delta.tool_calls[].function.argumentsArguments streamed in multiple chunks; concatenate
Tool Resultdelta.role == "tool", delta.tool_call_id, delta.contentTool output tied to a tool call id
Donechoices[].finish_reason == "stop"Signals end of a message

SSE Envelope

Each SSE line delivers a JSON payload:
data: {"id": "event_id", "object": "chat.completion.chunk", "choices": [...]}

Event Types

Processing Streaming events

How to Get the Code Snippet

You can generate a ready-to-use code snippet directly from the AI Gateway web UI:
  1. Go to the Playground or your MCP Server group in the AI Gateway.
  2. Click the API Code Snippet button.
  3. Copy the generated code and use it in your application.
The generated code snippet from the playground will only show the last assistant message, and will not show tool calls and results from that conversation.
TrueFoundry AI Gateway interface showing the API Code Snippet button

Agent API Code Snippet - Button

Generated code snippet for using the Agent API with MCP servers

Agent API Code Snippet - Example

Process streaming in code

OpenAI Client example

You can use the OpenAI client library with a custom base URL to handle the streaming response:

Configure client

from openai import OpenAI

client = OpenAI(
    base_url="https://<your-control-plane-url>/api/llm/agent",
    api_key="****",
)
  • Base URL: Point this to your Gateway URL with the /api/llm/agent path which directly targets the Agent API.

Define common Agent configuration

AGENT_CONFIG = {
    "model": "openai/gpt-4o",
    "stream": True,
    "extra_body": {
        "mcp_servers": [{
            "integration_fqn": "common-tools",
                "enable_all_tools": False,
            "tools": [{"name": "web_search"}, {"name": "code_executor"}]
        }],
        "iteration_limit": 10
    }
}
  • model: Provider/model routed via Gateway.
  • mcp_servers: Select specific tools from an MCP server.
  • iteration_limit: Max agent tool-call iterations.

Collect streamed chunks into full messages

The get_messages function processes the streaming response to reconstruct complete messages. Let’s break it down:
1. Initialize and detect new messages
def get_messages(chat_stream):
    messages = []
    previous_chunk_id = None

    for chunk in chat_stream:
        if not chunk.choices:
            continue

        delta = chunk.choices[0].delta

        # Detect new message when chunk ID changes
        if chunk.id != previous_chunk_id:
            previous_chunk_id = chunk.id
            messages.append({"role": delta.role, "content": ""})

        current_message = messages[-1]
What’s happening: Each streaming chunk has an ID. When the ID changes, it signals a new message starting (assistant, tool result, etc.). We create a new message object with the role and empty content.
2. Handle tool result messages
        # Set tool_call_id for tool result messages
        if delta.role == 'tool':
            current_message["tool_call_id"] = delta.tool_call_id
What’s happening: Tool result messages have role: "tool" and include a tool_call_id that links the result back to the specific tool call that generated it.
3. Accumulate message content
        # Accumulate content for all message types (assistant text, tool results)
        current_message["content"] += delta.content if delta.content else ""
What’s happening: Both assistant responses and tool results stream their content incrementally. We concatenate each chunk’s content to build the complete message.
4. Handle tool calls (function name and arguments)
        # Process tool calls - function names and arguments are streamed
        for tool_call in delta.tool_calls or []:
            # Initialize tool_calls array if needed
            if "tool_calls" not in current_message:
                current_message["tool_calls"] = []

            # Add new tool call if index exceeds current array length
            if tool_call.index >= len(current_message["tool_calls"]):
                current_message["tool_calls"].append({
                    "id": tool_call.id,
                    "type": "function",
        "function": {
                        "name": tool_call.function.name,
                        "arguments": tool_call.function.arguments or ""
                    }
                })

            # Accumulate function arguments (streamed incrementally)
            current_message["tool_calls"][tool_call.index]["function"]["arguments"] += tool_call.function.arguments or ""
What’s happening:
  • Tool calls are streamed with function names first, then arguments in chunks
  • Each tool call has an index to handle multiple simultaneous tool calls
  • We accumulate the arguments string as it streams in (like {"query": "Python tutorials"})
5. Apply Anthropic fix for empty arguments
    # Anthropic fix: normalize empty argument strings to valid JSON
    for msg in messages:
        if msg["role"] == "assistant" and len(msg.get("tool_calls", [])) > 0:
            for tool_call in msg["tool_calls"]:
                if not tool_call["function"]["arguments"].strip():
                    tool_call["function"]["arguments"] = "{}"

    return messages
What’s happening: Anthropic models sometimes send empty strings "" for tool arguments, but the OpenAI format expects "{}" for empty JSON objects. We normalize this.

Helper to send/merge a turn

def chat_with_agent(messages):
    stream = client.chat.completions.create(messages=messages, **AGENT_CONFIG)
    messages += get_messages(stream)
    return messages

Run a conversation and print outputs

conversation = [
    {"role": "user", "content": "Search for Python tutorials and run a simple hello world example"}
]

conversation = chat_with_agent(conversation)

conversation.append(
    {"role": "user", "content": "Now scrape a Python documentation page and extract key concepts"}
)

conversation = chat_with_agent(conversation)

for message in conversation:
    if message["role"] in ["user", "assistant"]:
        print(f'{message["role"].title()}: {message["content"]}')
    elif message["role"] == "tool":
        print(f'Tool Result ({message["tool_call_id"]}):\n{message["content"]}')
    if message["role"] == "assistant" and len(message.get("tool_calls", [])) > 0:
        for tool_call in message["tool_calls"]:
            print(f'Tool call id: {tool_call["id"]}')
            print(f'Tool call function: {tool_call["function"]["name"]}')
            print(f'Tool call arguments: {tool_call["function"]["arguments"]}\n\n')
    print()

Tool call flow

The streaming API follows this flow when tools are involved:
  1. Assistant Response Start: Initial content from the LLM (streamed)
  2. Tool Call Event: Function name, then arguments streamed incrementally
  3. Tool Execution: The gateway executes the complete tool call
  4. Tool Result Event: Results are streamed back
  5. Assistant Follow-up: The assistant processes results and continues

Stream termination

The stream ends with either:
  • A [DONE] message indicating completion
  • An error event if something goes wrong
  • Client disconnection