Prompt Caching

On this page

Minimum Cacheable Length
Usage
Monitoring Cache Performance

Prompt caching is a powerful feature that optimizes your API usage by allowing resuming from specific prefixes in your prompts. This approach significantly reduces processing time and costs for repetitive tasks or prompts with consistent elements.

Currently, only Anthropic models support this caching feature. See Anthropic documentation for more details.

Minimum Cacheable Length

1024 tokens: Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5, Claude Opus 3
2048 tokens: Claude Haiku 3.5, Claude Haiku 3

Usage

This feature is only available through direct REST API calls. The OpenAI SDK doesn’t recognize the cache_control field, so SDK-based implementations cannot utilize this feature.

Add the cache_control parameter to any message content you want to cache:

import requests
import json

USE_STREAM = True
URL = "https://{controlPlaneUrl}/api/llm/chat/completions"
API_KEY = "TFY_API_KEY"

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "X-TFY-METADATA": '{"tfy_log_request":"true"}'
}

payload = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "<TEXT_TO_CACHE>",
                    "cache_control": {
                        "type": "ephemeral"
                    }
                }
            ]
        }
    ],
    "model": "MODEL_NAME",
    "stream": USE_STREAM
}

try:
    response = requests.post(URL, headers=headers, json=payload)
    response.raise_for_status()

    if USE_STREAM:
        for line in response.iter_lines(decode_unicode=True):
            if line.startswith("data: "):
                line = line[6:]
            if line.strip() == "" or line.strip() == "[DONE]":
                continue
            content = json.loads(line).get("choices", [{}])[0].get("delta", {}).get("content", "")
            print(content, end="", flush=True)
    else:
        print(response.json()["choices"][0]["message"]["content"])

except requests.exceptions.RequestException as e:
    print("Request error:", e)

except (KeyError, json.JSONDecodeError) as e:
    print("Parsing error:", e)

Monitoring Cache Performance

Monitor cache performance using these API response fields, within usage in the response (or message_start event if streaming):

cache_creation_input_tokens: Tokens written to the cache when creating a new entry
cache_read_input_tokens: Tokens retrieved from the cache for this request

Configuring Parameters Embedding

Get Started

Developer Guide

Configure Gateway

MCP Registry and Gateway

Observability

Integrations

Deployment

API Reference

Chat

Agent Responses

Embeddings

Rerank

Responses

Audio

Batch

Files

Moderations

Minimum Cacheable Length

Usage

Monitoring Cache Performance

Get Started

Developer Guide

Configure Gateway

MCP Registry and Gateway

Observability

Integrations

Deployment

API Reference

Chat

Agent Responses

Embeddings

Rerank

Responses

Audio

Batch

Files

Moderations

​Minimum Cacheable Length

​Usage

​Monitoring Cache Performance

Minimum Cacheable Length

Usage

Monitoring Cache Performance