Open
Description
Describe the bug
Much slower scoring compared to livecodebench repo.
To Reproduce
Compare to LCB repo, scoring takes significantly longer and seems single threaded
Expected behavior
Match LCB scoring time with same number of threads set. I have tried setting num_process:
def codegen_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> float:
"""Estimates the Pass@1 metric for the code generation task.
Extract the code from each prediction, Runs it for each sample and generations,
and computes the Pass@1 over the outputs.
"""
# Extract generated code snippets
generated_code_snippets = [[extract_code(pred) for pred in predictions]] # noqa: F841
evaluation_sample = { # noqa: F841
"inputs": formatted_doc.specific["inputs"],
"outputs": formatted_doc.specific["outputs"],
"fn_name": formatted_doc.specific["fn_name"],
}
# This is a list of lists because
evaluation_sample = [{"input_output": json.dumps(evaluation_sample)}]
metrics, _ = codegen_metrics(
evaluation_sample,
generated_code_snippets,
k_list=[1], # Only run for Pass@1
num_process_evaluate=64,
)
return metrics["pass@1"]
with no improvement in scoring time
Version info
Please provide your operating system, lighteval version or commit if you installed from main, and pip/conda environment if your problem concerns dependencies.
latest master
conda create -n eval python=3.11
pip install vllm==0.7.2
pip install git+https://github.com/huggingface/lighteval.git#egg=lighteval[extended_tasks] math-verify==0.5.2