Deploying any model with FastAPI
iris_classifier.joblib
: The model fileserver.py
: The main FastAPI code that loads the model and provides a REST api to serve the model.requirements.txt
: Contains the dependencies.server.py
MODEL_DIR
environment variableWe are using an environment variable MODEL_DIR
to read the model path. This is useful when we want to load a model from a different path that can be downloaded and cached separately. See Cache Models and Artifacts guide for more details.async def
vs def
FastAPI is designed to be an async web framework. This means that the server can handle multiple requests concurrently.
When using async def
, it is the developer’s responsibility to make sure the function is not blocking the event loop.
However, most ML Inferences are compute bound and block the event loop. Use def
instead of async def
for inference functions.Generally, if you are not sure just use def
. FastAPI (mainly Startlette
which is the foundation of FastAPI) automatically runs the sync functions in a thread pool to not block the event loop.http://localhost:8000/docs
to try out our API.
Log the Model To Model Registry
Push the code to a Git repository or directly deploy from local machine
0.0.0.0
in commandPlease make sure you put --bind 0.0.0.0:8000
in the command. By default, gunicorn
binds to 127.0.0.1
. To make it accessible from outside the environment, we need to bind to 0.0.0.0
.gunicorn
is being told to bind on 8000
in command
, hence we use 8000
in ports
Download Model from Model Registry in the deployment configuration
MODEL_DIR
environment variable to the deployed service.
To enable this, you can modify the deployment configuration as follows:View the deployment, logs and metrics
How do I decide the resources for the deployment?
How many Gunicorn workers should I use?
Can the service handle more than 1 request at a time?
40
by default.
While this is useful to run many requests concurrently, ML Inference is compute bound, if our resources are not well tuned, we might get CPU Throttling or OOM.Can I use gevent or geventlet
gevent
or eventlet
as they are not compatible with asyncio.
Several machine learning libraries also use ThreadPools and ProcessPools which can cause issues with gevent
or eventlet
.Can I deploy the same service using Flask + Gunicorn?
sync
worker class (it is the default worker class) instead of gevent
or eventlet
--workers
argument to increase parallelism per container.
2
so that health check and inference can run concurrently. For scaling beyond that, use Service replicas
.