Article
evaluationllm-evaluationai-benchmarkingscientific-reasoningdeep-learning-researchopenai-api
GPQA Diamond Benchmark
Evaluate your Large Language Model's deep scientific reasoning using the GPQA Diamond Benchmark. This Action Pack guides you through setting up an evaluation environment, loading PhD-level science questions, and running your LLM against them to assess its true comprehension and multi-step problem-solving abilities.
intermediate30 min5 steps
The play
- Set Up Your EnvironmentCreate a Python virtual environment and install necessary libraries including `openai` and `pandas`. For demonstration, we'll use a mock `gpqa_diamond` library as the real one isn't publicly released.
- Load the GPQA Diamond Benchmark DatasetIntegrate a mock `gpqa_diamond` library to simulate loading a sample of graduate-level questions. This mock will allow you to test the evaluation pipeline.
- Integrate Your LLMSet up your OpenAI API key and choose an LLM model (e.g., `gpt-3.5-turbo` or `gpt-4`). This step prepares your model for querying.
- Run Evaluation LoopIterate through each question in the dataset, format a prompt for your LLM, and record its response. Store the LLM's answers for later analysis.
- Analyze and Report ResultsCalculate the overall accuracy of your LLM on the benchmark and display the results. This provides a quantitative measure of its scientific reasoning capabilities.
Starter code
import os
import openai
import pandas as pd
# Mock gpqa_diamond library for demonstration purposes
class MockGPQADiamond:
def load_dataset(self, split='test'):
return [
{
'question': "What is the primary function of mitochondria in eukaryotic cells?",
'options': ['A) Photosynthesis', 'B) Protein synthesis', 'C) ATP production', 'D) Waste removal'],
'answer_id': 'C',
'answer_text': 'ATP production'
},
{
'question': "Which of the following is an example of a non-Newtonian fluid?",
'options': ['A) Water', 'B) Honey', 'C) Air', 'D) Oobleck (cornstarch and water mixture)'],
'answer_id': 'D',
'answer_text': 'Oobleck (cornstarch and water mixture)'
}
]
gpqa_diamond = MockGPQADiamond()
# --- Copy-Paste Starter Code ---
# 1. Set your OpenAI API key (replace with your actual key or set as environment variable)
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
client = openai.OpenAI()
llm_model = "gpt-3.5-turbo"
# 2. Load the mock GPQA Diamond dataset
gpqa_dataset = gpqa_diamond.load_dataset(split='test')
print(f"Loaded {len(gpqa_dataset)} questions from GPQA Diamond benchmark (mock data).")
# 3. Run evaluation loop
results = []
for i, question_data in enumerate(gpqa_dataset):
question = question_data['question']
options = '\n'.join(question_data['options'])
correct_answer_id = question_data['answer_id']
prompt = f"""Answer the following graduate-level science question by choosing the best option (A, B, C, or D).
Question: {question}
Options:
{options}
Your answer (just the letter):"""
try:
response = client.chat.completions.create(
model=llm_model,
messages=[{"role": "user", "content": prompt}],
max_tokens=5,
temperature=0.0
)
llm_answer = response.choices[0].message.content.strip().upper()
is_correct = (llm_answer == correct_answer_id)
results.append({
'question_id': i + 1,
'llm_answer': llm_answer,
'correct_answer_id': correct_answer_id,
'is_correct': is_correct
})
print(f"Q{i+1}: LLM answered {llm_answer}, Correct: {correct_answer_id} -> {is_correct}")
except Exception as e:
print(f"Error processing Q{i+1}: {e}")
results.append({'question_id': i + 1, 'llm_answer': 'ERROR', 'correct_answer_id': correct_answer_id, 'is_correct': False})
# 4. Analyze results
df_results = pd.DataFrame(results)
if not df_results.empty:
accuracy = df_results['is_correct'].mean() * 100
print(f"\n--- GPQA Diamond Benchmark Results ---")
print(f"Accuracy: {accuracy:.2f}%")
print(df_results[['question_id', 'llm_answer', 'correct_answer_id', 'is_correct']].to_string())
else:
print("No results to display.")