Sticky Routing
Pin user requests using consistent hash based routing
Imagine a situation where a user is performing some expensive calculation in a request and wants to do so repeatedly over a bunch of API calls. Some of these expensive calculations might have some common computation that you might want to cache between requests. In such a case, you would like to route all the requests from a specific user to a specific replica.
One such example is Prefix Caching with model servers like vLLM and SGLang where generating the next reply in a converstation can benefit from the cached computation of the conversation history so far and result in significantly lower latencies.
Pods can be transitory and sticky routing is best-effort basis
It is important to remember that Kubernetes pods and nodes can be short lived and can be re-scheduled due to external events. It is not advised to put any critical stateful logic in Services.
Considering this, sticky routing is best effort basis. In the event of a pinned pod getting deleted / evicted, all sessions pinned to that pod will be re-assigned randomly to other pods. Similarly, in the event of a scale up or scale down, the sessions might be re-assigned to evenly distribute the load.
Enabling Sticky Routing
To enable sticky routing, first in the Service labels
section add a special
label tfy_sticky_session_header_name
and add a header name against it. For e.g. x-truefoundry-sticky-session-id
Passing the header when making requests
Now, your clients can send this header with a unique value for a "session". For e.g. this value can be a conversation id or user id, anything that identifies a unique session that can benefit from stickyness.
For e.g.
curl -X POST https://my-service.example.com/v1/chat/completions \
-H 'Content-Type: application/json' \
-H 'x-truefoundry-sticky-session-id: session-id-qnfjk' \
-d '{"messages": [{"role": "user", "content": "Hi"}]}'
import requests
response = requests.post(
"https://my-service.example.com/v1/chat/completions",
headers={"x-truefoundry-sticky-session-id": "session-id-qnfjk"},
json={"messages": [{"role": "user", "content": "Hi"}]}
}
response.raise_for_status()
print(response.json())
from openai import OpenAI
client = OpenAI(
base_url="https://my-service.example.com/v1",
api_key="<YOUR API KEY>",
default_headers={"x-truefoundry-sticky-session-id": "session-id-qnfjk"},
)
After the first request, any request that has header x-truefoundry-sticky-session-id: session-id-qnfjk
will be routed to the same pod the first request was routed to.
What happens if header is not added?
The request will be routed according to default load balancing policy (random / round robin / least load)
Updated 30 days ago