8000 GitHub - Gen-Verse/ScoreFlow: Official implementation for "ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization"
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Gen-Verse/ScoreFlow

Repository files navigation

ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization

Official implementation of our ScoreFlow. Affiliation: University of Chicago, Princeton University, University of Oxford.

Introduction

ScoreFlow is an automated multi-agent workflow generation and optimization method, which leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs.

Pipeline of ScoreFlow

Setup

  1. You need approximately 80-90GB VRAM. You can modify your GPU settings in ./config/config1.yaml.
  2. Set up your api key in ./config/config2.yaml. If your rate per minute is less than 1000, we suggest lowering the
  3. Download dataset and unzip it in this directory ./.
  4. To set up the environment, follow these steps:
conda create -n scoreflow python=3.10
source activate scoreflow
# Install MetaGPT locally to resolve any conflicts
unzip metagpt_local.zip
cd metagpt_local
pip install .
cd ..
pip install -r requirements.txt

You can ignore the dependency error of metagpt 1.0.0.

Optimization Process

To optimize the model, follow these steps iteratively from i = 0 then i = 1, and so on:

python generate.py --dataset=HumanEval --task=optimize --epoch=i
python evaluate.py --dataset=HumanEval --task=optimize --epoch=i
accelerate launch --num_processes=1 optimize.py --epoch=i

You can change the number of iterations by incrementing i. The dataset you can choose: HumanEval, MBPP, MATH, GSM8K, DROP, and HotpotQA.

Inference

To run inference on the test set, execute:

python generate.py --dataset=HumanEval --task=inference --epoch=2
python evaluate.py --dataset=HumanEval --task=inference --epoch=2

Notes

  • Adjust the number of epochs as needed.
  • Ensure all dependencies are installed from requirements.txt.
  • The optimization and inference scripts should be run sequentially.
  • If evaluate.py gets stuck, try rerunning it.
  • We have added the majority-optimal workflow structure for different tasks in this link: examples.

Citation

@article{wang2025scoreflow,
  title={ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization},
  author={Wang, Yinjie and Yang, Ling and Li, Guohao and Wang, Mengdi and Aragam, Bryon},
  journal={arXiv preprint arXiv:2502.04306},
  year={2025}
}

About

Official implementation for "ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0