Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors

LLM-based scoring systems are vulnerable to irrelevant factors, leading to unreliable and unfair assessments. This Action Pack shows you how to identify these factors, create adversarial test cases, and evaluate your LLM's robustness to ensure accurate and equitable scoring.

intermediate15 min3 steps

The play

Identify Irrelevant Factors
Analyze your domain, task, and rubric to pinpoint factors that *should not* influence scores but might (e.g., writing style, length, politeness, specific keywords). Hypothesize specific elements unrelated to the core skill being measured.
Craft Adversarial Test Cases
Generate text examples where construct-irrelevant factors are systematically manipulated, while the core construct's quality remains constant. Create pairs or sets of texts differing only in the irrelevant factor.
Evaluate LLM Robustness
Run your LLM-based scoring system on both the original and the perturbed texts created in Step 2. Compare the scores to quantify the LLM's sensitivity to the identified construct-irrelevant factors.

Starter code

import random

def perturb_text_style(text: str, factor: str = "verbosity") -> str:
    """Simulates adding a construct-irrelevant factor (e.g., verbosity) to text."""
    if factor == "verbosity":
        fillers = [
            "Additionally, it is important to consider that ",
            "Furthermore, it is worth noting that ",
            "In conclusion, the following point can be made: "
        ]
        words = text.split()
        if len(words) > 5:
            insert_idx = random.randint(1, len(words) - 2)
            words.insert(insert_idx, random.choice(fillers))
        return " ".join(words)
    elif factor == "politeness":
        text = text.replace("This is wrong.", "It appears there might be an alternative perspective.")
        return text
    return text

original_text = "The proposed solution is inefficient and lacks scalability."
perturbed_text_verbosity = perturb_text_style(original_text, factor="verbosity")
perturbed_text_politeness = perturb_text_style(original_text, factor="politeness")

print(f"Original: {original_text}")
print(f"Perturbed (Verbosity): {perturbed_text_verbosity}")
print(f"Perturbed (Politeness): {perturbed_text_politeness}")

# Integrate your LLM scoring here to compare scores for original vs. perturbed texts.