Article

model-servingllm-inferenceself-hostingopenai-apipythonpaged-attentionvllmhigh-throughput

Deploy an OpenAI-Compatible LLM API with vLLM

Use vLLM to self-host large language models with a high-performance, OpenAI-compatible API. This guide shows how to launch a local server for a model like Mistral-7B and query it using the standard OpenAI Python client.

intermediate30 min4 steps

The play

Install vLLM
Install vLLM using pip. vLLM requires a Python version >= 3.8 and a CUDA-enabled GPU (CUDA 11.8 or 12.1 is recommended). This command installs the core library needed for serving.
Launch the API Server
Start the vLLM server with a specified model from Hugging Face Hub. This command downloads the model (if not cached) and serves it via an OpenAI-compatible REST API on localhost:8000. We'll use Mistral-7B Instruct.
Test Server with cURL
In a new terminal, send a test request to the completions endpoint using cURL. This confirms the server is running and responsive. The 'model' name must match the one you are serving.
Query with the OpenAI Python Client
Use the official 'openai' library to interact with your local vLLM server. Install it with 'pip install openai', then run a script pointing the client's 'base_url' to your local endpoint. The 'api_key' is not used but must be provided.

Starter code

# This is a complete, runnable Python script to query a local vLLM server.
# Prerequisite: Install the openai library with `pip install openai`

# Step 1: In a separate terminal, start the vLLM server (requires a CUDA GPU).
# This will download the model on first run.
#
#   python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.1
#

# Step 2: Run this Python script to query the server.

from openai import OpenAI

# Modify the port if you've changed it in the server command
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed" # API key is not used but required by the client
)

print("--- Sending request to vLLM server ---")

try:
    completion = client.chat.completions.create(
      model="mistralai/Mistral-7B-Instruct-v0.1",
      messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "What are the three largest cities in the world by population?"}
      ],
      temperature=0.7,
      max_tokens=150
    )

    print("\nResponse from model:")
    print(completion.choices[0].message.content)

except Exception as e:
    print(f"\nAn error occurred: {e}")
    print("Please ensure the vLLM server is running in a separate terminal.")