Deploy an ML Model as a Serverless API with Modal

Use the Modal Python client to package a pre-trained Hugging Face model into a scalable, serverless API endpoint. This action pack shows how to define dependencies, load a model efficiently, and create a web endpoint for inference.

beginner15 min5 steps

The play

Install and Authenticate Modal
First, install the Modal client library using pip. Then, create an authentication token to connect your local environment to the Modal service. This is a one-time setup.
Define App and Dependencies
Create a Python file (e.g., `deploy_model.py`). Define a Modal App and specify the Python libraries needed for your model, like `torch` and `transformers`. Modal builds a container image with these dependencies.
Load the Model on Startup
To avoid cold starts, load the model once when the container starts. Use a class and the `@enter` decorator. This function downloads and caches the model, making it ready for inference requests.
Create the Inference Endpoint
Add a method to your class decorated with `@web_endpoint`. This exposes the method as a public API. The function will take text input, run it through the loaded model, and return the prediction as JSON.
Deploy and Test the API
Deploy the application from your terminal. Modal will provide a public URL for your new API endpoint. You can then test it using `curl` or any HTTP client.

Starter code

import modal

# Define the container image with necessary Python libraries
image = modal.Image.debian_slim().pip_install("transformers", "torch")

# Create a Modal App. The name is used for deployment.
app = modal.App("sentiment-analysis-api", image=image)

@app.cls()
class SentimentModel:
    """
    A class that loads a sentiment analysis model and provides a method for inference.
    """
    @modal.enter()
    def load_model(self):
        """
        This method is run once when the container starts up.
        It downloads the model from Hugging Face and initializes the pipeline.
        """
        from transformers import pipeline
        print("\nDownloading and loading sentiment analysis model...\n")
        # Using a default sentiment analysis model from Hugging Face
        self.sentiment_pipeline = pipeline("sentiment-analysis")
        print("\nModel loaded successfully!\n")

    @modal.web_endpoint(method="POST")
    def predict(self, data: dict):
        """
        This method is exposed as a web endpoint. It takes a JSON object
        with a 'text' field and returns the model's sentiment prediction.
        """
        if "text" not in data:
            # Return a 400 Bad Request if the 'text' key is missing
            return {"error": "'text' field not found in request body"}, 400

        text_to_analyze = data["text"]
        print(f"Performing inference for: '{text_to_analyze}'")
        
        # Run inference using the pre-loaded pipeline
        result = self.sentiment_pipeline(text_to_analyze)
        
        # The result from the pipeline is a list of dictionaries
        return {"prediction": result[0]}

# To deploy this API, save the code as a Python file (e.g., deploy_model.py)
# and run the following command in your terminal:
# modal deploy deploy_model.py
#
# To test the deployed endpoint, use a tool like curl:
# curl -X POST -H "Content-Type: application/json" \
# -d '{"text": "Modal is an amazing tool for deploying models."}' \
# https://your-modal-url.modal.run