Model Deployment FAQs

How do I get custom metrics of my deployed model?

To achieve this, you need to log custom metrics in your code, which can then be plotted in Grafana. TrueFoundry uses Prometheus by default to scrape all metrics exposed at the /metrics endpoint of your server.

You can export metrics to Grafana and add a new Visualization to your service dashboard.

For detailed steps, refer to the documentation here

Can I deploy a model trained on SageMaker to TrueFoundry?

Truefoundry offers a straightforward migration path for PyTorch endpoints currently running on SageMaker to the Truefoundry platform using a custom inference script. The process begins with an inference script and follows a workflow to implement the necessary changes. Follow this guide here

How do I promote an application from dev/statging to the production environment?

Promotion allows you to transfer changes from lower-level environments, such as development or staging, to higher-level environments like production.

Ensure you have the necessary permissions for both the dev/staging workspace and the target production workspace.

Click on the ellipsis next to your application and select "Promote." Choose your target workspace and application. Then review all differences in code, environment variables, and deployment configurations, and click "Submit" to initiate the promotion process. Documentation here

How to perform A/B testing of a new model version?

A/B testing can be done by a feature called 'Intercepts'.It provides a way to attach ad-hoc routing rules for a service deployed on the TrueFoundry platform.

Use it to redirect traffic arriving at a service to another one based on the header. It can be useful to test a new release for a service by intercepting the traffic on the main service and redirecting it to the copied service.Read in detail about the implementation here.

How to do load testing for your service?

To perform load testing on your deployed service, you can use Locust, which is an open-sourced tool for benchmarking services. For step-by-step instructions, follow this guide.

How to run a batch job/inference on a trained model?

'Jobs'can be used to run a batch inference for a trained model. Jobs are crucial for batch inference, efficiently scaling and managing resources to process large datasets and handle workloads in parallel.

My traffic outside business hours is really low? Can I autoscale accordingly?

We do support time-based auto-scaling. This is particularly useful for optimizing resource utilization and cost efficiency by scaling resources up or down according to anticipated workload patterns throughout the day or week.

. For example, scaling up replicas between Monday to Friday from 9 AM to 9 PM can be achieved using cron schedules. Additional auto-scaling options include adjusting for CPU usage and RPS (requests per second) as needed. You can configure autoscaling via UI or Python SDK.

For a deployment, when spot (fallback to on demand) is chosen, when the fallback happens, will the service be unavailable for that amount of time?

If a service is initially scheduled on a spot instance and that spot instance becomes unavailable, the service will attempt to launch another spot instance. If a spot instance is not available, the deployment will resort to bringing up an on-demand instance. During this process,the service might get unavailable. For critical production services, it is advisable to deploy multiple replicas. This approach reduces the likelihood of all instances failing simultaneously. Also, please configure liveness and readiness probes

Liveness probe: Checks whether the service's replica is currently healthy. If the service is not healthy, the replica will be terminated and another one will be started.
Readiness probe: Checks whether the service's replica is ready to receive traffic. Until the readiness probe succeeds, no incoming traffic will be routed to this replica.

What's the expected time to scale an additional replica?

New CPU machines typically start within 10-30 seconds, while GPU machines may take around 1 minute. Additional time is influenced by factors like downloading the Docker image and the service initialization period before it becomes ready. Occasionally, if the new replica is placed on the same machine, the Docker image might already be cached. For more precise information, providing details about the service and its specific requirements would allow for more tailored responses.

What are the best practices for choosing between RPS (Requests Per Second) and CPU-based autoscaling?

When deciding between RPS (Requests Per Second) and CPU-based autoscaling, the key is understanding how your service reacts to increased load:

RPS-Based Autoscaling: If higher traffic doesn’t significantly impact CPU usage, RPS is the preferred metric, as it’s easier to manage and reason about. This scenario is common in applications that are I/O bound, where the service can handle more requests without needing additional CPU resources.
CPU-Based Autoscaling: If increasing RPS leads to a proportional increase in CPU usage, CPU-based scaling becomes more effective.
The best practice is to observe traffic patterns and CPU usage to choose the most efficient autoscaling strategy.