Imagine a situation where a user is performing some expensive calculation in a request and wants to do so repeatedly over a bunch of API calls. Some of these expensive calculations might have some common computation that you might want to cache between requests. In such a case, you would like to route all the requests from a specific user to a specific replica.
One such example is Prefix Caching with model servers like vLLM and SGLang where generating the next reply in a converstation can benefit from the cached computation of the conversation history so far and result in significantly lower latencies.
User 1’s requests are always routed to Pod 1 while User 2’s requests are always routed to Pod 2
Pods can be transitory and sticky routing is best-effort basis
It is important to remember that Kubernetes pods and nodes can be short lived and can be re-scheduled due to external events. It is not advised to put any critical stateful logic in Services.Considering this, sticky routing is best effort basis. In the event of a pinned pod getting deleted / evicted, all sessions pinned to that pod will be re-assigned randomly to other pods. Similarly, in the event of a scale up or scale down, the sessions might be re-assigned to evenly distribute the load.
Enabling Sticky Routing
To enable sticky routing, first in the Service labels section add a special label tfy_sticky_session_header_nameand add a header name against it. For e.g. x-truefoundry-sticky-session-id
Now, your clients can send this header with a unique value for a “session”. For e.g. this value can be a conversation id or user id, anything that identifies a unique session that can benefit from stickyness.
For e.g.
curl -X POST https://my-service.example.com/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'x-truefoundry-sticky-session-id: session-id-qnfjk' \
-d '{"messages": [{"role": "user", "content": "Hi"}]}'
After the first request, any request that has header x-truefoundry-sticky-session-id: session-id-qnfjk will be routed to the same pod the first request was routed to.
The request will be routed according to default load balancing policy (random / round robin / least load)