Imagine a situation where a user is performing some expensive calculation in a request and wants to do so repeatedly over a bunch of API calls. Some of these expensive calculations might have some common computation that you might want to cache between requests. In such a case, you would like to route all the requests from a specific user to a specific replica.

One such example is Prefix Caching with model servers like vLLM and SGLang where generating the next reply in a converstation can benefit from the cached computation of the conversation history so far and result in significantly lower latencies.

User 1’s requests are always routed to Pod 1 while User 2’s requests are always routed to Pod 2

Pods can be transitory and sticky routing is best-effort basis

It is important to remember that Kubernetes pods and nodes can be short lived and can be re-scheduled due to external events. It is not advised to put any critical stateful logic in Services.

Considering this, sticky routing is best effort basis. In the event of a pinned pod getting deleted / evicted, all sessions pinned to that pod will be re-assigned randomly to other pods. Similarly, in the event of a scale up or scale down, the sessions might be re-assigned to evenly distribute the load.

Enabling Sticky Routing

To enable sticky routing, first in the Service labels section add a special label tfy_sticky_session_header_nameand add a header name against it. For e.g. x-truefoundry-sticky-session-id

Sticky Routing from UI

Passing the header when making requests

Now, your clients can send this header with a unique value for a “session”. For e.g. this value can be a conversation id or user id, anything that identifies a unique session that can benefit from stickyness.

For e.g.

curl -X POST https://my-service.example.com/v1/chat/completions \
    -H 'Content-Type: application/json' \
    -H 'x-truefoundry-sticky-session-id: session-id-qnfjk' \
    -d '{"messages": [{"role": "user", "content": "Hi"}]}'

After the first request, any request that has header x-truefoundry-sticky-session-id: session-id-qnfjk will be routed to the same pod the first request was routed to.

What happens if header is not added?

The request will be routed according to default load balancing policy (random / round robin / least load)