Skip to main content
Article
local-llminferenceggufquantizationcppself-hostingcpu-inference

Run LLMs Locally and Efficiently with llama.cpp

llama.cpp lets you run large language models on your own computer, even without a powerful GPU. This guide shows you how to compile the code, download a quantized model, and run inference via the command line and a local server.

intermediate30 min5 steps
The play
  1. Clone and Build llama.cpp
    Get the source code from GitHub and compile it. The `make` command builds the necessary executables, including `main` for inference and `server` for an OpenAI-compatible API. This works out-of-the-box on Linux and macOS.
  2. Download a GGUF Model
    llama.cpp uses the GGUF model format for efficient, quantized inference. Download a small, capable model from Hugging Face. We'll use Phi-2, which offers a good balance of performance and size for CPU inference.
  3. Run Command-Line Inference
    Use the `main` executable to run a simple text generation task. Pass the model path with `-m`, your prompt with `-p`, and the number of tokens to predict with `-n`. This is the quickest way to test your setup.
  4. Start the HTTP Server
    Run llama.cpp as a local server for other applications to call. The `server` executable provides an OpenAI-compatible API endpoint. `-c` sets the context size, and `--host 0.0.0.0` makes it accessible on your local network.
  5. Query the Server API
    With the server running in another terminal, send a POST request to the `/completion` endpoint to get a response. This mimics how you would integrate the local LLM into your own applications.
Starter code
#!/bin/bash
# This script will download and run a small LLM locally using llama.cpp

# 1. Clone and build llama.cpp
echo "Cloning and building llama.cpp..."
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

# 2. Download a small model (Phi-2 Q4_K_M GGUF)
echo "Downloading a small GGUF model..."
mkdir -p models
wget -O ./models/phi-2.Q4_K_M.gguf https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf

# 3. Run inference
echo "Running inference with a sample prompt..."
./main -m ./models/phi-2.Q4_K_M.gguf -n 256 -p "Write a short story about a robot who discovers music." --color

echo "\nDone! Explore the 'llama.cpp' directory to try more, like running './server' for an API."
Run LLMs Locally and Efficiently with llama.cpp — Action Pack