TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, currently available for Claude 3.7 Sonnet
(via Anthropic
, AWS Bedrock
, and Google Vertex AI
).
These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.
Enabling Reasoning Tokens
To enable thinking/reasoning tokens, your request must include:
- The header:
X-TFY-STRICT-OPENAI: false
- A
thinking
field in the request body
import requests
import json
url = "https://{controlPlaneUrl}/api/llm/chat/completions"
headers = {
"Authorization": f"Bearer {API_KEY}",
"X-TFY-STRICT-OPENAI": "false"
}
payload = {
"messages": [
{"role": "user", "content": "How to compute 3^3^3?"}
],
"model": "anthropic/claude-3-7",
"thinking": {
"type": "enabled",
"budget_tokens": 16000
},
"max_tokens": 18000
}
response = requests.post(url, headers=headers, json=payload)
When the X-TFY-STRICT-OPENAI
header is set to false
, the response is no longer OpenAI-compliant, as it introduces an additional reasoning layer that OpenAI’s compliance framework does not support.
When reasoning tokens are enabled, the response includes both thinking and content sections:
{
"id": "1742890579083",
"object": "chat.completion",
"created": 1742890579,
"model": "",
"provider": "aws",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": [
{
"type": "thinking",
"thinking": "The user has asked a complex question about quantum mechanics. To provide a useful answer, I should first break down the core concepts and then explain them in simple terms before diving into advanced details."
},
{
"type": "text",
"text": "Quantum mechanics is a branch of physics that explains how particles behave at very small scales. Unlike classical physics, where objects have definite positions and velocities, quantum particles exist in a superposition of states until measured. Would you like a more detailed explanation or examples?"
}
]
},
"finish_reason": "end_turn"
}
],
"usage": {
"prompt_tokens": 45,
"completion_tokens": 180,
"total_tokens": 225
}
}
Streaming with Reasoning Tokens
For streaming responses
, the thinking section is always sent before the content section.
Thinking Token Chunk
{
"id": "aws-1742890615621",
"object": "chat.completion.chunk",
"created": 1742890615,
"model": "",
"provider": "aws",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"thinking": "The user is asking about the differences between AI and machine learning. I should start by defining AI in general and then narrow down to how ML fits into it."
}
}
]
}
Content Token Chunk
{
"id": "aws-1742890615621",
"object": "chat.completion.chunk",
"created": 1742890615,
"model": "",
"provider": "aws",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"content": "Artificial Intelligence (AI) is a broad field of computer science focused on building systems that can perform tasks requiring human intelligence. Machine Learning (ML) is a subset of AI that enables computers to learn patterns from data and improve performance over time without explicit programming."
}
}
]
}
In streaming responses, the thinking chunk typically arrives first, followed by the content chunks.