Article

evaluationllm-evaluationai-benchmarkingscientific-reasoningdeep-learning-researchopenai-api

GPQA Diamond Benchmark

Evaluate your Large Language Model's deep scientific reasoning using the GPQA Diamond Benchmark. This Action Pack guides you through setting up an evaluation environment, loading PhD-level science questions, and running your LLM against them to assess its true comprehension and multi-step problem-solving abilities.

intermediate30 min5 steps

The play

Set Up Your Environment
Create a Python virtual environment and install necessary libraries including `openai` and `pandas`. For demonstration, we'll use a mock `gpqa_diamond` library as the real one isn't publicly released.
Load the GPQA Diamond Benchmark Dataset
Integrate a mock `gpqa_diamond` library to simulate loading a sample of graduate-level questions. This mock will allow you to test the evaluation pipeline.
Integrate Your LLM
Set up your OpenAI API key and choose an LLM model (e.g., `gpt-3.5-turbo` or `gpt-4`). This step prepares your model for querying.
Run Evaluation Loop
Iterate through each question in the dataset, format a prompt for your LLM, and record its response. Store the LLM's answers for later analysis.
Analyze and Report Results
Calculate the overall accuracy of your LLM on the benchmark and display the results. This provides a quantitative measure of its scientific reasoning capabilities.

Starter code

import os
import openai
import pandas as pd

# Mock gpqa_diamond library for demonstration purposes
class MockGPQADiamond:
    def load_dataset(self, split='test'):
        return [
            {
                'question': "What is the primary function of mitochondria in eukaryotic cells?",
                'options': ['A) Photosynthesis', 'B) Protein synthesis', 'C) ATP production', 'D) Waste removal'],
                'answer_id': 'C',
                'answer_text': 'ATP production'
            },
            {
                'question': "Which of the following is an example of a non-Newtonian fluid?",
                'options': ['A) Water', 'B) Honey', 'C) Air', 'D) Oobleck (cornstarch and water mixture)'],
                'answer_id': 'D',
                'answer_text': 'Oobleck (cornstarch and water mixture)'
            }
        ]

gpqa_diamond = MockGPQADiamond()

# --- Copy-Paste Starter Code ---

# 1. Set your OpenAI API key (replace with your actual key or set as environment variable)
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
client = openai.OpenAI()
llm_model = "gpt-3.5-turbo"

# 2. Load the mock GPQA Diamond dataset
gpqa_dataset = gpqa_diamond.load_dataset(split='test')
print(f"Loaded {len(gpqa_dataset)} questions from GPQA Diamond benchmark (mock data).")

# 3. Run evaluation loop
results = []
for i, question_data in enumerate(gpqa_dataset):
    question = question_data['question']
    options = '\n'.join(question_data['options'])
    correct_answer_id = question_data['answer_id']

    prompt = f"""Answer the following graduate-level science question by choosing the best option (A, B, C, or D).

Question: {question}
Options:
{options}

Your answer (just the letter):"""

    try:
        response = client.chat.completions.create(
            model=llm_model,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=5,
            temperature=0.0
        )
        llm_answer = response.choices[0].message.content.strip().upper()
        is_correct = (llm_answer == correct_answer_id)

        results.append({
            'question_id': i + 1,
            'llm_answer': llm_answer,
            'correct_answer_id': correct_answer_id,
            'is_correct': is_correct
        })
        print(f"Q{i+1}: LLM answered {llm_answer}, Correct: {correct_answer_id} -> {is_correct}")

    except Exception as e:
        print(f"Error processing Q{i+1}: {e}")
        results.append({'question_id': i + 1, 'llm_answer': 'ERROR', 'correct_answer_id': correct_answer_id, 'is_correct': False})

# 4. Analyze results
df_results = pd.DataFrame(results)
if not df_results.empty:
    accuracy = df_results['is_correct'].mean() * 100
    print(f"\n--- GPQA Diamond Benchmark Results ---")
    print(f"Accuracy: {accuracy:.2f}%")
    print(df_results[['question_id', 'llm_answer', 'correct_answer_id', 'is_correct']].to_string())
else:
    print("No results to display.")