From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Formalize informal LLM 'vibe-testing' by defining workflow-specific tasks and quantifiable User Experience (UX) metrics. This bridges qualitative user experience with robust performance data, enabling more reliable and practical LLM evaluation.

intermediate1 hour3 steps

The play

Define Workflow-Specific Tasks
Interview stakeholders to identify real-world LLM use cases and pain points. Create detailed test scenarios with specific prompts and desired output characteristics that mirror actual user workflows. For example: 'Summarize this 10-page report for a busy executive, highlighting key decisions needed.'
Establish Quantifiable UX Metrics
Translate subjective 'vibe' into measurable User Experience (UX) metrics. Define clear scoring rubrics (e.g., 1-5 scale) for dimensions like Relevance, Coherence, Tone, Completeness, Conciseness, and Helpfulness. Clearly define what each score represents for consistency.
Conduct User Evaluation & Analyze Results
Have target users evaluate LLM outputs against the defined tasks and UX metrics. Collect and aggregate scores from multiple evaluators. Analyze the data to quantify user satisfaction, identify performance gaps, and turn subjective feedback into actionable insights for LLM improvement.

Starter code

{
  "evaluation_scenario": {
    "id": "scenario_001",
    "name": "Executive Report Summary",
    "prompt": "Summarize this 10-page financial report for a busy executive, focusing on key decisions and risks.",
    "expected_output_characteristics": "Concise, actionable, highlights critical financial figures, identifies potential risks.",
    "metrics_to_score": {
      "relevance": "Scale 1-5 (1=not relevant, 5=highly relevant)",
      "conciseness": "Scale 1-5 (1=verbose, 5=to the point)",
      "actionability": "Scale 1-5 (1=not actionable, 5=highly actionable)",
      "tone": "Scale 1-5 (1=inappropriate, 5=professional)"
    }
  }
}