Paper·arxiv.org
llmevaluationresearchmachine-learningembeddingsbert-as-a-judge
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation
Implement BERT-as-a-Judge to robustly evaluate LLM outputs by measuring semantic similarity against reference answers. This overcomes the limitations of rigid lexical methods, providing more nuanced and accurate assessments for model selection and deployment.
intermediate15 min6 steps
The play
- Recognize Lexical Evaluation GapsUnderstand why traditional string-matching metrics (e.g., BLEU, ROUGE) are insufficient for accurately assessing the quality and nuance of generative LLM outputs, which often have semantic variations.
- Gather LLM Output and ReferenceFor a given prompt, obtain the Large Language Model's generated response and a corresponding human-written or gold-standard reference answer.
- Encode with BERT EmbeddingsUtilize a pre-trained BERT model (e.g., from Hugging Face Transformers) to convert both the LLM output and the reference answer into dense vector representations (embeddings). These embeddings capture the semantic meaning of the text.
- Compute Semantic SimilarityCalculate a similarity score, such as cosine similarity, between the BERT embeddings of the LLM output and the reference answer. A higher score indicates greater semantic alignment.
- Assess LLM QualityUse the calculated semantic similarity score as a robust metric for evaluating the quality, relevance, and accuracy of the LLM's response, moving beyond superficial lexical comparisons. This score serves as the 'judgment' from BERT.
- Integrate into Evaluation PipelineIncorporate this BERT-based semantic similarity evaluation into your continuous LLM development and deployment workflows for more reliable model selection, fine-tuning, and performance monitoring.
Starter code
from transformers import AutoModel, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity
import torch
# 1. Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Helper function to get BERT embedding for a text
def get_bert_embedding(text):
inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
# 2. Example LLM output and reference answer
llm_output = "The capital city of France is Paris."
reference_answer = "Paris is the capital of France."
# 3. Get embeddings for both texts
llm_embedding = get_bert_embedding(llm_output)
reference_embedding = get_bert_embedding(reference_answer)
# 4. Calculate cosine similarity between embeddings
similarity_score = cosine_similarity(llm_embedding.reshape(1, -1), reference_embedding.reshape(1, -1))[0][0]
# 5. Print results
print(f"LLM Output: '{llm_output}'")
print(f"Reference Answer: '{reference_answer}'")
print(f"BERT Semantic Similarity Score: {similarity_score:.4f}")Source