Paper·arxiv.org
llmevaluationresearchai-agentsprompt-engineering
From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs
Formalize how users "vibe-test" LLMs by structuring real-world task evaluation. Bridge the gap between informal user perception and rigorous, quantifiable metrics to develop more practically useful models.
intermediate1 hour5 steps
The play
- Identify Key User Workflows & TasksPinpoint specific, real-world tasks where your LLM will be used. These are the scenarios users will informally 'vibe-test' the model against in their daily work.
- Define Qualitative Evaluation CriteriaBrainstorm and list the subjective qualities (e.g., 'helpful tone', 'accurate facts', 'easy to understand', 'relevant output', 'efficiency gain') that contribute to a positive user 'vibe'. These will form the basis of your feedback structure.
- Design a Structured Feedback MechanismCreate a simple form or survey for users to provide feedback on the LLM's performance for each defined task, using the qualitative criteria from Step 2. Include a Likert scale or similar for quantifying subjective experience.
- Conduct Task-Specific User TrialsRecruit target users to perform the identified tasks using your LLM. Collect their feedback using your structured mechanism. Encourage open-ended comments for richer insights.
- Integrate Vibe-Test Data with BenchmarksAnalyze the collected qualitative and quantitative 'vibe-test' data. Combine these insights with traditional performance benchmarks to create a comprehensive evaluation framework. Use this holistic view to prioritize LLM improvements.
Starter code
```json
{
"task_name": "Drafting a marketing email",
"llm_version": "v3.1",
"user_id": "user_abc",
"feedback": {
"output_relevance": {
"rating": 4,
"comment": "Output was mostly relevant, but needed minor tweaks for tone."
},
"ease_of_use": {
"rating": 5,
"comment": "Prompting was straightforward."
},
"overall_satisfaction": {
"rating": 4,
"comment": "Good starting point, saved me time."
},
"suggested_improvements": [
"Make tone more persuasive by default."
]
}
}
```Source