Skip to main content
Repo·github.com
evaluationmachine-learningllmai-agentspythonhumaneval

HumanEval+

Use HumanEval+ to rigorously evaluate AI code generation models. This extended benchmark features significantly more test cases, ensuring robust assessment and preventing overfitting for better model generalization.

intermediate15 min4 steps
The play
  1. Install EvalPlus
    Install the EvalPlus library using pip. This package provides the extended HumanEval benchmark and its evaluation tools.
  2. Prepare Model Code Samples
    Format your LLM's generated code outputs into a JSONL file. Each line should be a JSON object with a 'task_id' (e.g., 'HumanEval/0') and 'completion' (the generated code string).
  3. Run Evaluation
    Execute EvalPlus against your model's prepared code samples. Replace 'your_model_name' with a unique identifier for the model you are evaluating.
  4. Analyze Results
    Review the detailed evaluation report generated by EvalPlus. This report includes pass@k metrics and identifies specific test case failures, providing insights into your model's generalization and robustness.
Starter code
pip install evalplus && \
echo '{"task_id": "HumanEval/0", "completion": "def add(a, b):\n    return a + b"}' > samples.jsonl && \
evalplus.evaluate --model_name "my-llm" --samples "samples.jsonl"
Source
HumanEval+ — Action Pack