Official implementation of our ScoreFlow. Affiliation: University of Chicago, Princeton University, University of Oxford.
ScoreFlow is an automated multi-agent workflow generation and optimization method, which leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs.
- You need approximately 80-90GB VRAM. You can modify your GPU settings in
./config/config1.yaml
. - Set up your api key in
./config/config2.yaml
. If your rate per minute is less than 1000, we suggest lowering the - Download dataset and unzip it in this directory
./
. - To set up the environment, follow these steps:
conda create -n scoreflow python=3.10
source activate scoreflow
# Install MetaGPT locally to resolve any conflicts
unzip metagpt_local.zip
cd metagpt_local
pip install .
cd ..
pip install -r requirements.txt
You can ignore the dependency error of metagpt 1.0.0.
To optimize the model, follow these steps iteratively from i = 0
then i = 1
, and so on:
python generate.py --dataset=HumanEval --task=optimize --epoch=i
python evaluate.py --dataset=HumanEval --task=optimize --epoch=i
accelerate launch --num_processes=1 optimize.py --epoch=i
You can change the number of iterations by incrementing i
. The dataset you can choose: HumanEval, MBPP, MATH, GSM8K, DROP, and HotpotQA.
To run inference on the test set, execute:
python generate.py --dataset=HumanEval --task=inference --epoch=2
python evaluate.py --dataset=HumanEval --task=inference --epoch=2
- Adjust the number of epochs as needed.
- Ensure all dependencies are installed from
requirements.txt
. - The optimization and inference scripts should be run sequentially.
- If
evaluate.py
gets stuck, try rerunning it. - We have added the majority-optimal workflow structure for different tasks in this link: examples.
@article{wang2025scoreflow,
title={ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization},
author={Wang, Yinjie and Yang, Ling and Li, Guohao and Wang, Mengdi and Aragam, Bryon},
journal={arXiv preprint arXiv:2502.04306},
year={2025}
}