Skip to main content
Paper·arxiv.org
llmevaluationresearchautomationmachine-learningai-agentsfine-tuningcontext-engineering

Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors

LLM-based scoring systems, despite high performance, are vulnerable to "construct-irrelevant factors"—elements unrelated to the actual skill being measured. This vulnerability compromises validity and fairness in automated assessment, highlighting the need for robust evaluation beyond superficial metrics.

intermediate2 hours6 steps
The play
  1. Acknowledge Construct-Irrelevant Factor Vulnerability
    Understand that your LLM-based scoring system, regardless of its performance metrics, is susceptible to construct-irrelevant factors (CIFs) that can bias scores without reflecting the true underlying construct.
  2. Identify Domain-Specific CIFs
    Brainstorm and document specific CIFs relevant to your application. For instance, in educational essay scoring, these could include politeness, specific stylistic choices, or irrelevant content length, if they don't contribute to the core skill being assessed.
  3. Design & Execute Sensitivity Tests
    Create test cases that systematically vary identified CIFs while keeping the core, intended construct constant. For example, provide identical content with different stylistic wrappers (e.g., polite vs. rude tone, verbose vs. concise irrelevant introductions).
  4. Implement Interpretability Tools
    Utilize AI interpretability tools (e.g., LIME, SHAP, attention visualization) to analyze which parts of the input an LLM prioritizes when making scoring decisions. This helps reveal if the model is over-relying on CIFs.
  5. Conduct Adversarial Testing
    Develop or employ adversarial techniques to deliberately probe for CIF vulnerabilities. Generate or modify inputs to trick the LLM into assigning incorrect scores based on irrelevant cues rather than actual merit.
  6. Report Limitations & Iterate
    Document identified CIFs, their impact on scoring, and implemented mitigation strategies. Establish a continuous monitoring and refinement process to ensure ongoing robustness, fairness, and validity of your LLM scoring system.
Starter code
import random

def mock_llm_score(text: str) -> float:
    """A mock LLM scoring function to demonstrate sensitivity testing."""
    # Simulate a base score based on content length (a common, sometimes irrelevant, factor)
    base_score = min(1.0, len(text) / 100.0) * 0.5 + 0.3 # Scale to a plausible range

    # Simulate influence of an irrelevant factor: 'flattery' or 'rudeness'
    if "excellent work" in text.lower() or "truly insightful" in text.lower():
        base_score += 0.2
    elif "terrible" in text.lower() or "poorly done" in text.lower():
        base_score -= 0.2
    
    # Ensure score is within 0-1 range
    return max(0.0, min(1.0, base_score + random.uniform(-0.05, 0.05)))

# Core content (e.g., a student's answer to a question)
core_content = "Photosynthesis is the process by which green plants and some other organisms use sunlight to synthesize foods with the help of chlorophyll."

# Test Case 1: Neutral framing
neutral_input = core_content
score_neutral = mock_llm_score(neutral_input)
print(f"Neutral input score: {score_neutral:.2f}")

# Test Case 2: Flattering, construct-irrelevant framing
flattery_input = "Excellent work, I must say! This is a truly insightful answer: " + core_content + " I am very impressed."
score_flattery = mock_llm_score(flattery_input)
print(f"Flattery input score: {score_flattery:.2f}")

# Test Case 3: Rude, construct-irrelevant framing
rude_input = "This is a terrible answer, but here it is: " + core_content + " You can do better."
score_rude = mock_llm_score(rude_input)
print(f"Rude input score: {score_rude:.2f}")

print("\n--- Analysis ---")
if abs(score_flattery - score_neutral) > 0.1 or abs(score_rude - score_neutral) > 0.1:
    print("Warning: The mock LLM appears sensitive to construct-irrelevant factors (flattery/rudeness).")
    print("This indicates a potential bias that needs further investigation in a real system.")
else:
    print("The mock LLM appears robust to these specific construct-irrelevant factors.")
Source
Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors — Action Pack