From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Formalize how users "vibe-test" LLMs by structuring real-world task evaluation. Bridge the gap between informal user perception and rigorous, quantifiable metrics to develop more practically useful models.

intermediate1 hour5 steps

The play

Identify Key User Workflows & Tasks
Pinpoint specific, real-world tasks where your LLM will be used. These are the scenarios users will informally 'vibe-test' the model against in their daily work.
Define Qualitative Evaluation Criteria
Brainstorm and list the subjective qualities (e.g., 'helpful tone', 'accurate facts', 'easy to understand', 'relevant output', 'efficiency gain') that contribute to a positive user 'vibe'. These will form the basis of your feedback structure.
Design a Structured Feedback Mechanism
Create a simple form or survey for users to provide feedback on the LLM's performance for each defined task, using the qualitative criteria from Step 2. Include a Likert scale or similar for quantifying subjective experience.
Conduct Task-Specific User Trials
Recruit target users to perform the identified tasks using your LLM. Collect their feedback using your structured mechanism. Encourage open-ended comments for richer insights.
Integrate Vibe-Test Data with Benchmarks
Analyze the collected qualitative and quantitative 'vibe-test' data. Combine these insights with traditional performance benchmarks to create a comprehensive evaluation framework. Use this holistic view to prioritize LLM improvements.

Starter code

```json
{
  "task_name": "Drafting a marketing email",
  "llm_version": "v3.1",
  "user_id": "user_abc",
  "feedback": {
    "output_relevance": {
      "rating": 4, 
      "comment": "Output was mostly relevant, but needed minor tweaks for tone."
    },
    "ease_of_use": {
      "rating": 5,
      "comment": "Prompting was straightforward."
    },
    "overall_satisfaction": {
      "rating": 4,
      "comment": "Good starting point, saved me time."
    },
    "suggested_improvements": [
      "Make tone more persuasive by default."
    ]
  }
}
```

Source

Paperarxiv.org