HumanEval | Notion

<aside> 💡

벤치마크 목록들의 사용법
- 다운받아서 쓰는지 or 특정 코드를 돌리는지
- 다운받아서 쓰면 어디서 다운받고 어떻게 돌리는지 </aside>

HumanEval 벤치마크 개요

docstring에서 프로그램을 합성하기 위한 functional correctness 측정, 코드를 학습한 LLMs 모델 평가

<aside> 🔎

이 연구는 GitHub의 공개 코드로 파인튜닝된 GPT 언어 모델인 Codex를 소개합니다. 이 모델은 Python 코드 작성 능력을 갖추고 있으며, GitHub Copilot의 기반 기술이 되었습니다.

연구팀은 코드 문서화 문자열(docstring)에서 기능적으로 정확한 프로그램을 합성하는 능력을 측정하기 위해 HumanEval이라는 새로운 평가 세트를 개발했습니다. Codex는 이 평가에서 28.8%의 문제를 해결했으며, 이는 GPT-3(0%)와 GPT-J(11.4%)보다 뛰어난 성능입니다.

또한, 모델에서 반복적으로 샘플링하는 전략이 어려운 프롬프트에 대한 작동 솔루션을 생성하는 데 놀라울 정도로 효과적임을 발견했습니다. 이 방법을 사용하여 문제당 100개의 샘플로 전체 문제의 70.2%를 해결했습니다.

모델에 대한 면밀한 조사를 통해 긴 작업 체인을 설명하는 문서화 문자열 처리와 변수에 작업을 바인딩하는 데 어려움을 겪는 등의 한계점도 발견되었습니다.

마지막으로, 강력한 코드 생성 기술 배포의 잠재적 영향에 대해 안전성, 보안성, 경제적 측면에서 논의하고 있습니다.

We introduce Codex, a GPT language model finetuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

</aside>

논문: Evaluating Large Language Models Trained on Code

HumanEval 사용 방법

특정 코드

https://github.com/openai/human-eval

$ git clone <https://github.com/openai/human-eval>
$ pip install -e human-eval

# 평가하는 방법
from human_eval.data import write_jsonl, read_problems

problems = read_problems()

num_samples_per_task = 200
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
    for task_id in problems
    for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)

# 평가 샘플
$ evaluate_functional_correctness samples.jsonl
Reading samples...
32800it [00:01, 23787.50it/s]
Running test suites...
100%|...| 32800/32800 [16:11<00:00, 33.76it/s]
Writing results to samples.jsonl_results.jsonl...
100%|...| 32800/32800 [00:00<00:00, 42876.84it/s]
{'pass@1': ..., 'pass@10': ..., 'pass@100': ...}

다운로드

https://huggingface.co/datasets/openai/openai_humaneval

from datasets import load_dataset

ds = load_dataset("openai/openai_humaneval")

데이터셋 인스턴스 예시

{
    "task_id": "test/0",
    "prompt": "def return1():\\n",
    "canonical_solution": "    return 1",
    "test": "def check(candidate):\\n    assert candidate() == 1",
    "entry_point": "return1"
}

HumanEval 벤치마크 개요

HumanEval 사용 방법

특정 코드

다운로드

데이터셋 인스턴스 예시

참고 자료