Article
local-llminferenceggufquantizationcppself-hostingcpu-inference
Run LLMs Locally and Efficiently with llama.cpp
llama.cpp lets you run large language models on your own computer, even without a powerful GPU. This guide shows you how to compile the code, download a quantized model, and run inference via the command line and a local server.
intermediate30 min5 steps
The play
- Clone and Build llama.cppGet the source code from GitHub and compile it. The `make` command builds the necessary executables, including `main` for inference and `server` for an OpenAI-compatible API. This works out-of-the-box on Linux and macOS.
- Download a GGUF Modelllama.cpp uses the GGUF model format for efficient, quantized inference. Download a small, capable model from Hugging Face. We'll use Phi-2, which offers a good balance of performance and size for CPU inference.
- Run Command-Line InferenceUse the `main` executable to run a simple text generation task. Pass the model path with `-m`, your prompt with `-p`, and the number of tokens to predict with `-n`. This is the quickest way to test your setup.
- Start the HTTP ServerRun llama.cpp as a local server for other applications to call. The `server` executable provides an OpenAI-compatible API endpoint. `-c` sets the context size, and `--host 0.0.0.0` makes it accessible on your local network.
- Query the Server APIWith the server running in another terminal, send a POST request to the `/completion` endpoint to get a response. This mimics how you would integrate the local LLM into your own applications.
Starter code
#!/bin/bash # This script will download and run a small LLM locally using llama.cpp # 1. Clone and build llama.cpp echo "Cloning and building llama.cpp..." git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp make # 2. Download a small model (Phi-2 Q4_K_M GGUF) echo "Downloading a small GGUF model..." mkdir -p models wget -O ./models/phi-2.Q4_K_M.gguf https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.gguf # 3. Run inference echo "Running inference with a sample prompt..." ./main -m ./models/phi-2.Q4_K_M.gguf -n 256 -p "Write a short story about a robot who discovers music." --color echo "\nDone! Explore the 'llama.cpp' directory to try more, like running './server' for an API."