LitServe is a lightweight and fast inference server for machine learning models. It can be a good alternative to FastAPI if you are looking for dynamic batching support. It also has higher level abstractions for authentication, middleware, OpenAI compatible spec, streaming etc built on top of FastAPI.

In this example, we will deploy a simple Whisper model (speech to text) using faster-whisper and Litserve. You can find the code for this example here.

Live Demo

You can view this example deployed here.

The key files are:

  • whisper_server.py: Contains the WhisperLitAPI that implements the LitAPI interface.
  • requirements.txt: Contains the dependencies.

How to write the inference function in LitServe

The whisper_server.py file contains the WhisperLitAPI class that implements the LitAPI interface.

...
def _get_model_dir():
    if "MODEL_DIR" not in os.environ:
        raise Exception(
            "MODEL_DIR environment variable is not set. Please set it to the directory containing the model."
        )
    return os.environ["MODEL_DIR"]

MODEL_DIR = _get_model_dir()

class WhisperLitAPI(ls.LitAPI):
    def setup(self, device):
        # Load the OpenAI Whisper model
        self.model = WhisperModel(MODEL_DIR, device="cpu")

    def decode_request(self, request: UploadFile):
        return request.file

    def predict(self, audio_stream):
        # Process the audio stream bytes and return the transcription result
        output = []
        segments, _ = self.model.transcribe(audio_stream)
        for segment in segments:
            output.append({"start": segment.start, "end": segment.end, "text": segment.text})
        return output

    def encode_response(self, output):
        # Output is already a list of dicts. So we can return it as is.
        return output

Mainly we inherit the LitAPI class and implement the setup, decode_request, predict and encode_response methods.

  • setup: Load the model.
  • decode_request: Decodes and transforms the request body to the input format expected by the model.
  • predict: Processes the output of decode_request and runs model inference.
  • encode_response: Formats the response. Can perform any postprocessing on the response.

See LitServe Docs for reference and more examples.

Load the file from the path specified in the MODEL_DIR environment variable

We are using an environment variable MODEL_DIR to read the model path. This is useful when we want to load a model from a different path that can be downloaded and cached separately. See Cache Models and Artifacts guide for more details.

Running the server locally

  1. Install the dependencies
Shell
pip install -r requirements.txt
  1. Run the server
Shell
export MODEL_DIR="Systran/faster-whisper-tiny"
python whisper_server.py
  1. Test the server
Shell
curl -X POST http://0.0.0.0:8000/predict -F "request=@./audio.mp3"

We’ll see something like this:

[
    {"start":0.0,"end":5.0,"text":" Oh, you think darkness is your ally."},
    {"start":5.0,"end":8.0,"text":" Are you merely adopted the dark?"},
    {"start":8.0,"end":11.0,"text":" I was born in it."},
    {"start":11.0,"end":14.0,"text":" More lit by it."},
    {"start":14.0,"end":17.0,"text":" I didn't see the light until I was already a man,"},
    {"start":17.0,"end":20.0,"text":" but then it was nothing to me but brightened."}
]

Deploying to TrueFoundry

Since the models are being pulled from HuggingFace Hub, we can directly deploy the code to TrueFoundry and use Artifacts Download feature to automatically download the model.

You can also log the model to TrueFoundry Model Registry and use it as source for automatic model download.

1

Push the code to a Git repository or directly deploy from local machine

Once you have tested your code locally, we highly recommend pushing the code a Git repository. This allows you to version control the code and also makes the deployment process much easier. However, if you don’t have access to a Git repository, or the Git repositories are not integrated with Truefoundry, you can directly deploy from local laptop. You can follow the guide here to deploy your code.

Configure PythonBuild

2

Download Model from HuggingFace Hub in the deployment configuration

Add the model id and revision from HuggingFace Hub in Artifacts Download section

3

View the deployment, logs and metrics

Once the deployment goes through, you can view the deployment, the pods, logs, metrics and events to debug any issues.