ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

Official implementation of our ScoreFlow. Affiliation: University of Chicago, Princeton University, University of Oxford.

Introduction

ScoreFlow is an automated multi-agent workflow generation and optimization method, which leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs.

Setup

You need approximately 80-90GB VRAM. You can modify your GPU settings in ./config/config1.yaml.
Set up your api key in ./config/config2.yaml. If your rate per minute is less than 1000, we suggest lowering the
Download dataset and unzip it in this directory ./.
To set up the environment, follow these steps:

conda create -n scoreflow python=3.10
source activate scoreflow
# Install MetaGPT locally to resolve any conflicts
unzip metagpt_local.zip
cd metagpt_local
pip install .
cd ..
pip install -r requirements.txt

You can ignore the dependency error of metagpt 1.0.0.

Optimization Process

To optimize the model, follow these steps iteratively from i = 0 then i = 1, and so on:

python generate.py --dataset=HumanEval --task=optimize --epoch=i
python evaluate.py --dataset=HumanEval --task=optimize --epoch=i
accelerate launch --num_processes=1 optimize.py --epoch=i

You can change the number of iterations by incrementing i. The dataset you can choose: HumanEval, MBPP, MATH, GSM8K, DROP, and HotpotQA.

Inference

To run inference on the test set, execute:

python generate.py --dataset=HumanEval --task=inference --epoch=2
python evaluate.py --dataset=HumanEval --task=inference --epoch=2

Notes

Adjust the number of epochs as needed.
Ensure all dependencies are installed from requirements.txt.
The optimization and inference scripts should be run sequentially.
If evaluate.py gets stuck, try rerunning it.
We have added the majority-optimal workflow structure for different tasks in this link: examples.

Citation

@article{wang2025scoreflow,
  title={ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization},
  author={Wang, Yinjie and Yang, Ling and Li, Guohao and Wang, Mengdi and Aragam, Bryon},
  journal={arXiv preprint arXiv:2502.04306},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

Introduction

Setup

Optimization Process

Inference

Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
ScoreFlow		ScoreFlow
config		config
figures		figures
README.md		README.md
evaluate.py		evaluate.py
generate.py		generate.py
get_scores.py		get_scores.py
metagpt_local.zip		metagpt_local.zip
optimize.py		optimize.py
requirements.txt		requirements.txt

Gen-Verse/ScoreFlow

Folders and files

Latest commit

History

Repository files navigation

ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

Introduction

Setup

Optimization Process

Inference

Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages