Download Models and Artifacts

If you are deploying models as APIs, you will need the model file to be present along with the service so that you can load and serve it. There are a few ways people usually end up doing this:

Bake the model file into the docker image

This works, however, it has the following disadvantages:

  1. The docker image size can become huge if the model file is big.
  2. We will need to rebuild the docker image if the model file changes.
  3. Every time the pod needs to come up on a new machine, it will end up downloading the model file again.

Download the model file from S3 at the start of the server

  1. In this case, every time a pod comes up, we need to download the model first which leads to high startup time.
  2. It also means more networking costs every time the models are downloaded.

Create a volume, download the model file on the volume, and mount the volume to your service

This approach has a lot of benefits:

  1. Smaller docker image size leads to faster startup
  2. Faster loading of models from volume leads the faster startup times
  3. Cost savings because of not having to download the model repeatedly.

However, this can be a bit cumbersome to set up since the steps need to be done manually. Also, one more thing to be careful of here is the upgrade process from one model version to another. The key issues here are:

  1. Upgrading requires manual intervention of having to download the new version of the model on the volume.
  2. If we swap the older model on the volume with the newer model, and a pod restarts just after that, it will lead to one pod using the newer version of the model, even though we haven't started the deployment process.

To overcome the above issues, you can use the Artifacts Download feature in Truefoundry which automatically provisions a volume, downloads the model onto the volume, mounts the volume and also handles upgrades safely. The way this works during the first deployment is:

As you can see in the diagram above, the user will need to define the link to the model to be downloaded and an environment variable in which Truefoundry will provide the path where the model is present. The reason this is done this way is to provide a smooth and failsafe upgrade process. When you are deploying again, the following flow will be followed:

In this scenario, the model is not downloaded if there is no change in the artifact version. However, if there is a change, Truefoundry downloads the new model at a different path and then restarts all the servers with the new path in the environment variable. It also cleans up the older model once the rollout is fully complete. This makes it easy to switch back to the older version during a rollout if anything goes wrong.

Download and Cache Models through the User Interface

Download and Cache Models using the Python SDK

If you are using Python SDK to manage deployments then you need to add the following in your deploy.py file:

from servicefoundry import Build, Service, DockerFileBuild, Port, ArtifactsDownload, ArtifactsCacheVolume, HuggingfaceArtifactSource, TruefoundryArtifactSource

service = Service(
    name="my-service",
    image=Build(build_spec=DockerFileBuild()),
    ports=[
      Port(
        host="your_host",
        port=8501
      )
    ],
+    artifacts_download=ArtifactsDownload(
+        artifacts=[
+            HuggingfaceArtifactSource(
+                model_id="NousResearch/Llama-2-7b-hf",
+                revision="dacdfcde31297e34b19ee0e7532f29586d2c17bc",
+                download_path_env_variable="MODEL_ID_1",
+            ),
+            TruefoundryArtifactSource(
+                artifact_version_fqn = "artifact:truefoundry/your-ml-repo/your-artifact:1",
+                download_path_env_variable="MODEL_ID_2"
+            )
+                
+        ],
+        cache_volume=ArtifactsCacheVolume(
+            storage_class="azureblob-nfs-premium",
+            cache_size=200
+        )
+    )
)
service.deploy(workspace_fqn="YOUR_WORKSPACE_FQN")

Using the Cached Model

In your Service code, to use the cached model, you can utilize the Download Path Environment Variable you specified while configuring the Download and Cache Artifact.

You can do one of these two things, Get it directly in your code using os.environ.get() or Pass it as a command-line argument if your service needs it that way.

Direct Reference

For example, if you specify MODEL_ID_1 as theDownload Path Environment Variable, to download model.joblib file, you can access the cached model like this:

import os
import joblib

# Get the model_path using the environment variable
model_path_1 = os.environ.get("MODEL_ID_1")

# Load the model using the specified path
model_1 = joblib.load(os.path.join(model_path_1, "model.joblib"))

Command Line Argument

For example, in case you are deploying an LLM on top of the vLLM model server using the vllm/vllm-openai:v0.2.6 docker image. The following takes the model and tokenizer as a command-line argument.

You can pass the Download Path Environment Variable in the arguments.

Download Private Hugging Face Hub Models

To download a non-public Hugging Face Hub model, such as https://huggingface.co/meta-llama/Llama-2-7b-chat-hf, make sure to include your Hugging Face Hub token in the environment section.

You can obtain your token by visiting https://huggingface.co/settings/tokens.

Through User Interface

Locate the Environment Variables section in the deployment form, and add the token like the following.

Through Python SDK

from servicefoundry import Build, Service, DockerFileBuild, Port, ArtifactsDownload, ArtifactsCacheVolume, HuggingfaceArtifactSource, TruefoundryArtifactSource

service = Service(
    name="my-service",
    image=Build(build_spec=DockerFileBuild()),
    ports=[
      Port(
        host="your_host",
        port=8501
      )
    ],
    artifacts_download=ArtifactsDownload(
        artifacts=[
            HuggingfaceArtifactSource(
                model_id="meta-llama/Llama-2-7b-chat-hf",
                revision="<revision id>",
                download_path_env_variable="MODEL_ID",
            ),    
        ],
    ),
+    env={
+      "HUGGING_FACE_HUB_TOKEN": "<paste your token here>"
+    }
)
service.deploy(workspace_fqn="YOUR_WORKSPACE_FQN")