Skip to main content
Paper·arxiv.org
llmevaluationresearchmachine-learningembeddingsbert-as-a-judge

BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation

Implement BERT-as-a-Judge to robustly evaluate LLM outputs by measuring semantic similarity against reference answers. This overcomes the limitations of rigid lexical methods, providing more nuanced and accurate assessments for model selection and deployment.

intermediate15 min6 steps
The play
  1. Recognize Lexical Evaluation Gaps
    Understand why traditional string-matching metrics (e.g., BLEU, ROUGE) are insufficient for accurately assessing the quality and nuance of generative LLM outputs, which often have semantic variations.
  2. Gather LLM Output and Reference
    For a given prompt, obtain the Large Language Model's generated response and a corresponding human-written or gold-standard reference answer.
  3. Encode with BERT Embeddings
    Utilize a pre-trained BERT model (e.g., from Hugging Face Transformers) to convert both the LLM output and the reference answer into dense vector representations (embeddings). These embeddings capture the semantic meaning of the text.
  4. Compute Semantic Similarity
    Calculate a similarity score, such as cosine similarity, between the BERT embeddings of the LLM output and the reference answer. A higher score indicates greater semantic alignment.
  5. Assess LLM Quality
    Use the calculated semantic similarity score as a robust metric for evaluating the quality, relevance, and accuracy of the LLM's response, moving beyond superficial lexical comparisons. This score serves as the 'judgment' from BERT.
  6. Integrate into Evaluation Pipeline
    Incorporate this BERT-based semantic similarity evaluation into your continuous LLM development and deployment workflows for more reliable model selection, fine-tuning, and performance monitoring.
Starter code
from transformers import AutoModel, AutoTokenizer
from sklearn.metrics.pairwise import cosine_similarity
import torch

# 1. Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Helper function to get BERT embedding for a text
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# 2. Example LLM output and reference answer
llm_output = "The capital city of France is Paris."
reference_answer = "Paris is the capital of France."

# 3. Get embeddings for both texts
llm_embedding = get_bert_embedding(llm_output)
reference_embedding = get_bert_embedding(reference_answer)

# 4. Calculate cosine similarity between embeddings
similarity_score = cosine_similarity(llm_embedding.reshape(1, -1), reference_embedding.reshape(1, -1))[0][0]

# 5. Print results
print(f"LLM Output: '{llm_output}'")
print(f"Reference Answer: '{reference_answer}'")
print(f"BERT Semantic Similarity Score: {similarity_score:.4f}")
Source
BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation — Action Pack