Repo·github.com
codeevaluationtestingcode-generationllm-evaluationbenchmarkpython
HumanEval+
Evaluate your code generation models using HumanEval+, an extended version of OpenAI's HumanEval benchmark. This Action Pack guides you through setting up the benchmark, generating code solutions, and running the enhanced evaluation with additional test cases.
intermediate30 min5 steps
The play
- Clone the HumanEval+ RepositoryObtain the HumanEval+ benchmark by cloning its official GitHub repository to your local machine.
- Set Up Your EnvironmentNavigate into the cloned directory and install the required Python dependencies to prepare your evaluation environment.
- Generate Code CompletionsIntegrate your code generation LLM to produce solutions for the problems defined in HumanEval+. Save these completions in the expected format (e.g., JSONL) for evaluation.
- Run the Evaluation ScriptExecute the HumanEval+ evaluation script against your generated code completions. This script will run the original and extended test cases.
- Analyze Evaluation ResultsReview the output from the evaluation script, focusing on pass@k metrics and detailed results for both original and additional test cases to understand your model's performance.
Starter code
git clone https://github.com/HumanEvalPlus/HumanEvalPlus.git cd HumanEvalPlus pip install -e .
Source