HumanEval+

Use HumanEval+ to rigorously evaluate AI code generation models. This extended benchmark features significantly more test cases, ensuring robust assessment and preventing overfitting for better model generalization.

intermediate15 min4 steps

The play

Install EvalPlus
Install the EvalPlus library using pip. This package provides the extended HumanEval benchmark and its evaluation tools.
Prepare Model Code Samples
Format your LLM's generated code outputs into a JSONL file. Each line should be a JSON object with a 'task_id' (e.g., 'HumanEval/0') and 'completion' (the generated code string).
Run Evaluation
Execute EvalPlus against your model's prepared code samples. Replace 'your_model_name' with a unique identifier for the model you are evaluating.
Analyze Results
Review the detailed evaluation report generated by EvalPlus. This report includes pass@k metrics and identifies specific test case failures, providing insights into your model's generalization and robustness.

Starter code

pip install evalplus && \
echo '{"task_id": "HumanEval/0", "completion": "def add(a, b):\n    return a + b"}' > samples.jsonl && \
evalplus.evaluate --model_name "my-llm" --samples "samples.jsonl"

Source

Repogithub.com