Rollout Strategy

Configure how to rollout a new version of your application

Rolling out a new version of your application shouldn't incur downtime for your application - specially for production applications. In some cases, you might also want to test the new version of your application on a small segment of traffic before rolling out to everyone to catch unexpected issues. This is where the rollout strategy comes in.

Let's imagine you have 5 replicas running for v1 of your service. We want to roll out our new shiny version v2. There are a few ways we can roll out v2:

  1. Stop all 5 replicas of v1 and then bring up 5 replicas of v2. - This will cause a downtime of the application while the 5 replicas of v2 are starting up and hence is not a great idea.
  2. Start 1 replica of v2 - when 1 replica of v2 is up, stop 1 replica of v1. Continue the process again for the remaining 4 replicas till all 5 replicas are swapped - This makes sure there are always 5 replicas of v2 that are running and there is no disruption in quality of the service. However, there is a small duration when 6 replicas will be live since we only stop v1 replica after v2 replica is live. This strategy is called Rolling Update Strategy.
  3. Stop 1 replica of v1, then start 1 replica of v2. Continue the process till all 5 replicas of v1 are swapped. If we do this, there is a small period of time when 4 replicas are serving the traffic and hence this can cause a slight disruption in traffic, specially if all 5 replicas are fully required to serve the traffic. This strategy is called Rolling Update Strategy.
  4. Bring up 1 replica of v2 - once it is up, divert 20% traffic to v2. Then stop one replica of v1, bring up another replica of v2 and divert additional 20% traffic - hence making 40% traffic served by v2 and 60% traffic served by v1. You can continue this process till 100% traffic is served by v2 and 0% is served by v1 - This is what is called Canary deployment. . This might seem similar to the Rolling Update strategy defined above - however, in case of Canary, you can shift traffic smaller than the traffic handled by one pod, pause in between the rollout and control the traffic shift increments in a more finegrained way.
  5. Bring up 5 replicas completely of v2, shift 100% traffic to v2 and then stop the 5 replicas of v1. - This is what is called BlueGreen Deployment.

In the approaches #2 and #3, we might want to swap more then 1 replica at a time to increase the speed of rollout. Since otherwise, if there are a 1000 replicas, rolling out one replica at a time can cause the deployment to take multiple hours.

Using Truefoundry, you can configure the rollout strategy according to your needs. The different rollout strategies and their configuration are described below.

Rolling Update Strategy

Rolling Update is a deployment strategy that seamlessly replaces the old version of your application with the new one, ensuring a gradual transition without disruptions. In this case, we want to be able to configure how many replicas of the old version get replaced with the new version and whether we create new replicas of the new version before shutting off the replicas of the older version. This can be configured using the following parameters:

  • Max Surge(%): This defines the max number of replicas of the older and newer version of the application that can be live at any point during the rollout. This is expressed as a percentage of the current number of the replicas. If the current number of replicas is 10 and we set max surge to 20%, this means that 12 replicas can at max be live any point. What this implies is the system will bring up 2 replicas of v2 before shutting down 2 v1 replicas.
  • Max Unavailable(%): This defines how many minimum replicas (old + new version combined) should be available during the rollout. It is expressed as a percentage of the current number of the replicas.. For example, if replica count is set to 12 for the deployment and max unavailable is set to 25%, then minimum of 9 replicas need to be live at any point during the rollout. What this means is the system can bring down 3 replicas of v1 before starting the 3 replicas of v2.

Let's assume we are upgrading the deployment from v1 to v2 assuming both Max surge and max unavailable are set to 25% and initial replicas are 12.

Stepv1 Replicasv2 ReplicasTotal Replicas
Before rollout12012
It will bring down 3 v1 replicas because minimum available replica(max unavailable) should be 9.909
It will bring up 6 v2 replicas because maximum available replicas(max surge) can be 159615
Once all v2 replicas are available then it will again bring down 6 v1 replicas369
It will bring up 3 v2 replicas because the total should be 12 at rollout completion01212
Rollout Complete01212

A good default startegy is to keep max_unavailable at 10% and max_surge at 10%. If you are working with GPU machines and don't want additional machines to come up during rollout, you can set max_surge to 0% and max_unavailable to 20%

Configuring Rolling Update for an Application

Via the User Interface (UI)

Via the Python SDK

In your Service deployment code deploy.py, include the following:

from truefoundry.deploy import Build, Service, DockerFileBuild, Rolling

service = Service(
    name="my-service",
    image=Build(build_spec=DockerFileBuild()),
    ports=[
      Port(
        host="your_host",
        port=8501
      )
    ]
+    rollout_strategy=Rolling(
+        max_surge_percentage=10,
+        max_unavailable_percentage=25,
+    )
)
service.deploy(workspace_fqn="YOUR_WORKSPACE_FQN")
name: my-service
type: service
image:
  type: build
  build_source:
    type: local
  build_spec:
    type: dockerfile
ports:
 - port: 8501
 	 host: your_host
+rollout_strategy:
+  type: rolling_update
+  max_surge_percentage: 10
+  max_unavailable_percentage: 25

One limitation that we have in rolling update is that we cannot control how much traffic goes to v1 and v2. This is where Canary rollout comes into play.

Canary

Canary deployment is a cautious approach to rolling out updates where a small subset of users gets to try the new version first before it's made available to everyone. It is ideal for minimizing risks associated with major updates to test the new deployment before a full rollout.

Some key terms to get comfortable with before we can understand this strategy.

  1. Canary Steps: You can have multiple stages/steps in your canary rollout. You get absolute control control at each step regarding how much traffic should be redirected to the new version and should it automatically move to the next step after a certain time interval or a manual trigger is needed to move it to the next step.
  2. Set traffic weight to: This is the percentage of traffic that should be redirected to the new version at each stage.
  3. Pause for: This helps to define if the rollout should move to next step after a certain time or a manual trigger is needed.

Let's assume we are upgrading the deployment from v1 to v2, assuming we want to:

  1. Deploy total 10% traffic to new version and wait for 60 seconds.
  2. Deploy total 30% traffic to new version and wait until we manually trigger to next step.
  3. Deploy total 80% traffic to new version and wait for 30 seconds.
Stepsv1 Traffic(%)v2 Traffic(%)Pause For(seconds)
Before rollout100%NA
1 - Automatic90%10%60
2 - Automatic70%30%Pause Indefintely
3- Manual trigger next step20%80%30
Rollout complete0%100%NA

Configuring Canary for an Application

Via the User Interface (UI)

Via the Python SDK

In your Service deployment code deploy.py, include the following:

from truefoundry.deploy import Build, Service, DockerFileBuild, Canary, CanaryStep

service = Service(
    name="my-service",
    image=Build(build_spec=DockerFileBuild()),
    ports=[
      Port(
        host="your_host",
        port=8501
      )
    ]
+    rollout_strategy=Canary(
+        steps=[
+            CanaryStep(weight_percentage=25, pause_duration=30),
+            CanaryStep(weight_percentage=50, pause_duration=30),
+            CanaryStep(weight_percentage=75, pause_duration=30)
+        ]
+    )
)
service.deploy(workspace_fqn="YOUR_WORKSPACE_FQN")

name: my-service
type: service
image:
  type: build
  build_source:
    type: local
  build_spec:
    type: dockerfile
ports:
 - port: 8501
 	 host: your_host
+rollout_strategy:
+  type: canary
+  steps:
+    - weight_percentage: 25
+      pause_duration: 30
+    - weight_percentage: 50
+      pause_duration: 30
+    - weight_percentage: 75
+      pause_duration: 30

Blue-Green:

Blue-Green is a strategy that ensures a seamless transition between different versions by maintaining two identical environments: one for the current version (blue) and one for the new version (green). It minimises downtime and provides an instant fallback mechanism in case issues arise in the new version.

Let's assume we are upgrading the deployment from v1(blue) to v2(green):

  1. We will create v2 as a full clone of v1. If there were 12 replicas in v1 then we create exact 12 replicas in v2.
  2. Once all the pods are ready in v2, we move full traffic to v2.
  3. This moving traffic from v1 to v2 can be automatic or it can be manual depending on what we configure.

Blue Green Configuration

For configuring Blue Green, you can adjust the following settings:

  • Auto-promotion: Toggle this setting to automatically promote the new version to active once it passes all tests and checks.
  • Auto-promotion seconds: Specify the duration, in seconds, to wait before automatically promoting the new version to active.

Configuring Rolling Update for an Application

Via the User Interface (UI)

Via the Python SDK

In your Service deployment code deploy.py, include the following:

from truefoundry.deploy import Build, Service, DockerFileBuild, BlueGreen

service = Service(
    name="my-service",
    image=Build(build_spec=DockerFileBuild()),
    ports=[
      Port(
        host="your_host",
        port=8501
      )
    ]
+    rollout_strategy=BlueGreen(
+        enable_auto_promotion=True,
+        auto_promotion_seconds=30
+    )

  
)
service.deploy(workspace_fqn="YOUR_WORKSPACE_FQN")
name: my-service
type: service
image:
  type: build
  build_source:
    type: local
  build_spec:
    type: dockerfile
ports:
 - port: 8501
 	 host: your_host
+rollout_strategy:
+  type: blue_green
+  enable_auto_promotion: true
+  auto_promotion_seconds: 30

One drawback of using this strategy is heavy cost. If you had 12 replicas in v1 then 12 v2 replicas are brought up so you have to pay cost for a total of 24 replicas until the rollout is complete.