Article
model-servingllm-inferenceself-hostingopenai-apipythonpaged-attentionvllmhigh-throughput
Deploy an OpenAI-Compatible LLM API with vLLM
Use vLLM to self-host large language models with a high-performance, OpenAI-compatible API. This guide shows how to launch a local server for a model like Mistral-7B and query it using the standard OpenAI Python client.
intermediate30 min4 steps
The play
- Install vLLMInstall vLLM using pip. vLLM requires a Python version >= 3.8 and a CUDA-enabled GPU (CUDA 11.8 or 12.1 is recommended). This command installs the core library needed for serving.
- Launch the API ServerStart the vLLM server with a specified model from Hugging Face Hub. This command downloads the model (if not cached) and serves it via an OpenAI-compatible REST API on localhost:8000. We'll use Mistral-7B Instruct.
- Test Server with cURLIn a new terminal, send a test request to the completions endpoint using cURL. This confirms the server is running and responsive. The 'model' name must match the one you are serving.
- Query with the OpenAI Python ClientUse the official 'openai' library to interact with your local vLLM server. Install it with 'pip install openai', then run a script pointing the client's 'base_url' to your local endpoint. The 'api_key' is not used but must be provided.
Starter code
# This is a complete, runnable Python script to query a local vLLM server.
# Prerequisite: Install the openai library with `pip install openai`
# Step 1: In a separate terminal, start the vLLM server (requires a CUDA GPU).
# This will download the model on first run.
#
# python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.1
#
# Step 2: Run this Python script to query the server.
from openai import OpenAI
# Modify the port if you've changed it in the server command
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # API key is not used but required by the client
)
print("--- Sending request to vLLM server ---")
try:
completion = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.1",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What are the three largest cities in the world by population?"}
],
temperature=0.7,
max_tokens=150
)
print("\nResponse from model:")
print(completion.choices[0].message.content)
except Exception as e:
print(f"\nAn error occurred: {e}")
print("Please ensure the vLLM server is running in a separate terminal.")