Sticky Routing

Pin user requests using consistent hash based routing

Imagine a situation where a user is performing some expensive calculation in a request and wants to do so repeatedly over a bunch of API calls. Some of these expensive calculations might have some common computation that you might want to cache between requests. In such a case, you would like to route all the requests from a specific user to a specific replica.

One such example is Prefix Caching with model servers like vLLM and SGLang where generating the next reply in a converstation can benefit from the cached computation of the conversation history so far and result in significantly lower latencies.

User 1's requests are always routed to Pod 1 while User 2's requests are always routed to Pod 2

🚧

Pods can be transitory and sticky routing is best-effort basis

It is important to remember that Kubernetes pods and nodes can be short lived and can be re-scheduled due to external events. It is not advised to put any critical stateful logic in Services.

Considering this, sticky routing is best effort basis. In the event of a pinned pod getting deleted / evicted, all sessions pinned to that pod will be re-assigned randomly to other pods. Similarly, in the event of a scale up or scale down, the sessions might be re-assigned to evenly distribute the load.

Enabling Sticky Routing

To enable sticky routing, first in the Service labels section add a special label tfy_sticky_session_header_nameand add a header name against it. For e.g. x-truefoundry-sticky-session-id

Sticky Routing from UI

Passing the header when making requests

Now, your clients can send this header with a unique value for a "session". For e.g. this value can be a conversation id or user id, anything that identifies a unique session that can benefit from stickyness.

For e.g.

curl -X POST https://my-service.example.com/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'x-truefoundry-sticky-session-id: session-id-qnfjk' \
    -d '{"messages": [{"role": "user", "content": "Hi"}]}'
import requests

response = requests.post(
    "https://my-service.example.com/v1/chat/completions",
    headers={"x-truefoundry-sticky-session-id": "session-id-qnfjk"},
    json={"messages": [{"role": "user", "content": "Hi"}]}
}
response.raise_for_status()
print(response.json())
from openai import OpenAI

client = OpenAI(
    base_url="https://my-service.example.com/v1",
    api_key="<YOUR API KEY>",
    default_headers={"x-truefoundry-sticky-session-id": "session-id-qnfjk"},
)

After the first request, any request that has header x-truefoundry-sticky-session-id: session-id-qnfjk will be routed to the same pod the first request was routed to.

📘

What happens if header is not added?

The request will be routed according to default load balancing policy (random / round robin / least load)