Benchmark LLMs with the Model Evaluation Harness

Use the Model Evaluation Harness script to benchmark models against standard datasets like MMLU and GSM8K. Generate detailed reports and comparison charts to identify the best model for your specific use case.

intermediate30 min4 steps

The play

Install the Harness and Dependencies
Clone the script's repository and install the required Python packages from requirements.txt. This sets up your local environment to run evaluations from the command line.
Run a Single Benchmark
Evaluate a single model against a standard benchmark like GSM8K. Use the `--limit` flag to run on a small subset of the data for a quick test. Results are saved to a `results/` directory.
Compare Multiple Models
Evaluate multiple models simultaneously by providing a comma-separated list. The Model Evaluation Harness will produce a comparison report with charts and tables contrasting their performance.
Evaluate on a Custom Dataset
Test a model on your own data. Create a JSONL file where each line is a JSON object with 'prompt' and 'answer' keys, then pass the file path to the `--custom_tasks` argument.

Starter code

#!/bin/bash
# This script provides a complete setup and run of the Model Evaluation Harness.

# 1. Clone the repository and navigate into it
if [ ! -d "model-evaluation-harness" ]; then
  git clone https://github.com/aaas/model-evaluation-harness.git
fi
cd model-evaluation-harness

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run a quick evaluation on the GSM8K benchmark
# This uses a small, well-known model and limits the test to 5 examples for speed.
echo "Running evaluation on gpt2 with GSM8K (limit 5)..."
python evaluate.py \
  --model hf-causal/gpt2 \
  --tasks gsm8k \
  --limit 5

echo "
Evaluation complete. Check the 'results/' directory for the output report."