Reasoning Models (Claude Only)
Thinking/reasoning tokens will be available via TrueFoundry LLM Gateway for Claude 3.7 Sonnet (accessible through Anthropic, AWS Bedrock, and Google Vertex AI).
These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.
OpenAI Compliance in Responses
When the X-TFY-STRICT-OPENAI
header is not included in the request, the response remains fully OpenAI-compliant. However, enabling thinking/reasoning tokens makes the response no longer OpenAI-compliant, as it introduces an additional reasoning layer that OpenAI’s compliance framework does not support.
A standard OpenAI-compliant response structure looks like this:
{
"id": "1742890436073",
"object": "chat.completion",
"created": 1742890436,
"model": "",
"provider": "aws",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I assist you today? I'm here to provide information, answer questions, or help with tasks. Feel free to ask anything!"
},
"finish_reason": "end_turn"
}
],
"usage": {
"prompt_tokens": 20,
"completion_tokens": 120,
"total_tokens": 140
}
}
Enabling Thinking/Reasoning Tokens
To enable thinking/reasoning tokens, the request must include:
- The header:
X-TFY-STRICT-OPENAI: false
- A
thinking
field in the request body:
Example Request Body:
{
"messages": [
{
"role": "user",
"content": "How to compute 3^3^3?"
}
],
"model": "anthropic/claude-3-7",
"thinking": {
"type": "enabled",
"budget_tokens": 16000
},
"max_tokens": 18000,
"stream": false
}
Example Response:
{
"id": "1742890579083",
"object": "chat.completion",
"created": 1742890579,
"model": "",
"provider": "aws",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": [
{
"type": "thinking",
"thinking": "The user has asked a complex question about quantum mechanics. To provide a useful answer, I should first break down the core concepts and then explain them in simple terms before diving into advanced details."
},
{
"type": "text",
"text": "Quantum mechanics is a branch of physics that explains how particles behave at very small scales. Unlike classical physics, where objects have definite positions and velocities, quantum particles exist in a superposition of states until measured. Would you like a more detailed explanation or examples?"
}
]
},
"finish_reason": "end_turn"
}
],
"usage": {
"prompt_tokens": 45,
"completion_tokens": 180,
"total_tokens": 225
}
}
Streaming Responses with Thinking Tokens
For streaming responses, the thinking
section is always sent before the content
section.
Example Chunk - Thinking Token:
{
"id": "aws-1742890615621",
"object": "chat.completion.chunk",
"created": 1742890615,
"model": "",
"provider": "aws",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"thinking": "The user is asking about the differences between AI and machine learning. I should start by defining AI in general and then narrow down to how ML fits into it."
}
}
]
}
Example Chunk - Content Token:
{
"id": "aws-1742890615621",
"object": "chat.completion.chunk",
"created": 1742890615,
"model": "",
"provider": "aws",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"content": "Artificial Intelligence (AI) is a broad field of computer science focused on building systems that can perform tasks requiring human intelligence. Machine Learning (ML) is a subset of AI that enables computers to learn patterns from data and improve performance over time without explicit programming."
}
}
]
}
In a streaming response, the thinking chunk typically arrives first, followed by the content.
Updated 3 days ago