Skip to main content
Paper·arxiv.org
llmevaluationresearchai-agentsprompt-engineering

From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Formalize how users "vibe-test" LLMs by structuring real-world task evaluation. Bridge the gap between informal user perception and rigorous, quantifiable metrics to develop more practically useful models.

intermediate1 hour5 steps
The play
  1. Identify Key User Workflows & Tasks
    Pinpoint specific, real-world tasks where your LLM will be used. These are the scenarios users will informally 'vibe-test' the model against in their daily work.
  2. Define Qualitative Evaluation Criteria
    Brainstorm and list the subjective qualities (e.g., 'helpful tone', 'accurate facts', 'easy to understand', 'relevant output', 'efficiency gain') that contribute to a positive user 'vibe'. These will form the basis of your feedback structure.
  3. Design a Structured Feedback Mechanism
    Create a simple form or survey for users to provide feedback on the LLM's performance for each defined task, using the qualitative criteria from Step 2. Include a Likert scale or similar for quantifying subjective experience.
  4. Conduct Task-Specific User Trials
    Recruit target users to perform the identified tasks using your LLM. Collect their feedback using your structured mechanism. Encourage open-ended comments for richer insights.
  5. Integrate Vibe-Test Data with Benchmarks
    Analyze the collected qualitative and quantitative 'vibe-test' data. Combine these insights with traditional performance benchmarks to create a comprehensive evaluation framework. Use this holistic view to prioritize LLM improvements.
Starter code
```json
{
  "task_name": "Drafting a marketing email",
  "llm_version": "v3.1",
  "user_id": "user_abc",
  "feedback": {
    "output_relevance": {
      "rating": 4, 
      "comment": "Output was mostly relevant, but needed minor tweaks for tone."
    },
    "ease_of_use": {
      "rating": 5,
      "comment": "Prompting was straightforward."
    },
    "overall_satisfaction": {
      "rating": 4,
      "comment": "Good starting point, saved me time."
    },
    "suggested_improvements": [
      "Make tone more persuasive by default."
    ]
  }
}
```
Source
From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs — Action Pack