Skip to main content
Repo·github.com
codeevaluationtestingcode-generationllm-evaluationbenchmarkpython

HumanEval+

Evaluate your code generation models using HumanEval+, an extended version of OpenAI's HumanEval benchmark. This Action Pack guides you through setting up the benchmark, generating code solutions, and running the enhanced evaluation with additional test cases.

intermediate30 min5 steps
The play
  1. Clone the HumanEval+ Repository
    Obtain the HumanEval+ benchmark by cloning its official GitHub repository to your local machine.
  2. Set Up Your Environment
    Navigate into the cloned directory and install the required Python dependencies to prepare your evaluation environment.
  3. Generate Code Completions
    Integrate your code generation LLM to produce solutions for the problems defined in HumanEval+. Save these completions in the expected format (e.g., JSONL) for evaluation.
  4. Run the Evaluation Script
    Execute the HumanEval+ evaluation script against your generated code completions. This script will run the original and extended test cases.
  5. Analyze Evaluation Results
    Review the output from the evaluation script, focusing on pass@k metrics and detailed results for both original and additional test cases to understand your model's performance.
Starter code
git clone https://github.com/HumanEvalPlus/HumanEvalPlus.git
cd HumanEvalPlus
pip install -e .
Source
HumanEval+ — Action Pack