Repo·github.com
evaluationmachine-learningllmai-agentspythonhumaneval
HumanEval+
Use HumanEval+ to rigorously evaluate AI code generation models. This extended benchmark features significantly more test cases, ensuring robust assessment and preventing overfitting for better model generalization.
intermediate15 min4 steps
The play
- Install EvalPlusInstall the EvalPlus library using pip. This package provides the extended HumanEval benchmark and its evaluation tools.
- Prepare Model Code SamplesFormat your LLM's generated code outputs into a JSONL file. Each line should be a JSON object with a 'task_id' (e.g., 'HumanEval/0') and 'completion' (the generated code string).
- Run EvaluationExecute EvalPlus against your model's prepared code samples. Replace 'your_model_name' with a unique identifier for the model you are evaluating.
- Analyze ResultsReview the detailed evaluation report generated by EvalPlus. This report includes pass@k metrics and identifies specific test case failures, providing insights into your model's generalization and robustness.
Starter code
pip install evalplus && \
echo '{"task_id": "HumanEval/0", "completion": "def add(a, b):\n return a + b"}' > samples.jsonl && \
evalplus.evaluate --model_name "my-llm" --samples "samples.jsonl"Source