Using Mult-task Gaussian Process Regression to maximize average benchmark score across a sequence of LLM training checkpoints
BOMAX is a Python package for Bayesian Optimization with Multi-task Gaussian Process Regression, designed for efficient optimization of expensive-to-evaluate functions across multiple related tasks. The tool was designed with a specific problem in mind, that of choosing the optimal LLM checkpoint, but the code is application-agnostic.
The problem that inspired BOMAX was optimizing LLM learning curves. A learning curve visualizes the performance of a model during training, showing how the performance changes over some time unit (steps/iterations/epochs). In classic ML, the performance is usually measured on a small validation set, sometimes called a hold-out set (i.e. 10-20% of the training data is held out). The goal is to capture the model parameters at the peak of this curve. However, if we are training/fine-tuning an LLM for multiple uses, our validation set might actually be a combination of many different benchmark tasks. After all, most LLM-leaderbaords rank LLMs by their average benchmark score.
Let's say we have a set of model 100 model checkpoints that were saved at regular intervals during training/fine-tuning. Suppose that running a single model on a single benchmark takes 1 minute, and we have 100 benchmark tasks. Running all models on all tasks would take 10000 minutes = 1 week. How could we efficiently estimate the model checkpoint with the highest average benchmark performance without running all model checkpoints through all benchmark tasks?
The key is to use Bayesian Optimization with Multi-task Gaussian Process Regression. Bayesian Optimization (BO for short) is a common approach for such optimization scenarios, when the function we are trying to optimize is expensive to compute. Sometimes it's called "black-box" optimization since we treat the function as a "black-box" and only have access to it by querying it's value at specific points. The basic approach is to repeatedly query (i.e. evaluate) the function so as to acquire more sample points with which to estimate a regression model, and then use that regression model to optimize the function. Naturally this process is a loop, the main steps being:
- use the current regression model to decide which point
$x_i$ to query next - update the regression model using newly observed function value
$y_i = f(x_i)$
Gaussian process regression models are a popular choice for Bayesian optimization, since they provide explicit uncertainty estimates that can be used to guide step 1. The standard method is through an acquisition function, which basically provides a formula for turning the GP uncertainty estimates into the next query point
The key to the current software package is a novel formulation of Expected Improvement for optimizing the average of multiple (possibly) correlated functions:
See here for the full derivation.
- Novel formulation of Expected Improvement for optimizing an average of multiple "black-box" functions
- Uses SOTA BoTorch package for GPU-accelerated Bayesian Optimization in PyTorch
- Visualization tools for monitoring optimization progress
- Support for both CPU and GPU acceleration
You can install BOMAX in development mode:
# Clone the repository
git clone https://github.com/davidsvaughn/bomax.git
cd bomax
# Install inside virtual environment (in editable mode)
virtualenv -p python3.10 venv && source venv/bin/activate
pip install -e .
Here's a simple example of how to use BOMAX:
import numpy as np
from bomax.sampler import MultiTaskSampler
from bomax.utils import generate_learning_curves
# Generate synthetic data
X_feats, Y_curves = generate_learning_curves(50, 25)
# Smooth each learning curve (for comparison to GP regression)
Y_smoothed_curves = np.array([ndimage.gaussian_filter1d(col, sigma=num_inputs/10) for col in Y_curves.T]).T
# Initialize the sampler
sampler = MultiTaskSampler(*Y_curves.shape, # number of checkpoints and tasks
func=lambda i,j: Y_curves[i,j], # black-box function callback
)
# Seed with initial observations (at least 2 obs/task for numerical stability)
sampler.initialize()
# Run Bayesian optimization loop
for _ in range(50):
# Fit the GP model to the current observations
sampler.update()
# Compare with Gold Standard data and generate plot
sampler.compare(Y_smoothed_curves)
# Determine next sample coordinates and query black-box function
sampler.sample()
For a more detailed example, see the examples/demo.py
file.
The package is organized as follows:
src/bomax/
: Main package directory__init__.py
: Package initializationsampler.py
: MultiTaskSampler class for Bayesian optimizationinitialize.py
: Functions for initializing samplesnormalize.py
: Data normalization utilitiesdegree.py
: Degree metric for curve complexitystopping.py
: Early stopping conditionsutils.py
: Utility functions
examples/
: Example scriptsdata/
: Example datasets
This project is licensed under the MIT License - see the LICENSE file for details.