Releases: allenai/OLMo-core
v2.1.0
What's new
Added π
- Added 50B Dolmino 11/24 mix.
- Added support for auxiliary-loss-free MoE load-balancing, similar to DeepSeek-v3. You can activate this by setting
bias_gamma
to a non-zero float in yourMoERouter
config. - Added support for sequence-level MoE load balancing loss.
- Compatibility with B200s.
- Added support for
warmup_fraction
as an alternative towarmup_steps
in all schedulers, allowing warmup to be specified as a fraction of total training steps. - A better config for the 1B model, ported from the old OLMo trainer.
- Added
auto_resume
option toCometCallback
for resume an existing run. - (BETA) Added methods
load_hf_model
andsave_hf_model
for saving supported OLMo Core models to HF transformers format.
Also added lower-level methods for converting state between the formats. - Added the ability to run the evaluator callback on
.pre_train()
by settingeval_on_startup=True
, and to cancel the run after the first time evals run by settingcancel_after_first_eval=True
. - Added support for label mask files with numpy FSL datasets.
- Added a
git
configuration toBeakerLaunchConfig
.
Changed β οΈ
TransformerTrainModuleConfig
can now be used to build aTransformerPipelineTrainModule
by adding app_config
spec. This makes theTransformerPipelineTrainModuleConfig
redundant, but it will be kept around for backwards compatibility until the next major release.- Several state dict methods in
TrainModule
now take anoptim
option, which can disable the use of optimizer state. - Updated
Float8Config
for latest version oftorchao
. - Undo a fix applied to
olmo_core.data.numpy_dataset.NumpyFSLDatasetMixture
that was generating a mismatch between the shape of instances in the dataset and the shape of instances in the data loader. - Made the 1B and 7B scripts more similar to each other.
- Changed underlying logic and top-level arguments of
convert_checkpoint_from_hf.py
andconvert_checkpoint_to_hf.py
. - Beaker experiments launched with the
BeakerLaunchConfig
will now log with ANSI colors enabled.
Fixed β
- Fixed calculation of total steps based on epochs at the end of a training job.
- Fixed a bug where the trainer might try to save a duplicate final checkpoint if the run that already completed was restarted.
- When submitting a Beaker job from a branch that's tracking a GitHub fork, OLMo-core now instructs Beaker to pull from the fork instead of from the main repo.
- Made Beaker image resolution more robust.
- Having
t_max
overrides in the default model configs is confusing and error prone, so we removed them. - Beaker launcher will only clone a single branch at runtime when possible, which can be much faster.
Commits
b8070fb (chore) prepare for release v2.1.0
7bc8aa2 remove erroneous license in test file
db91b7f Add a git config to BeakerLaunchConfig (#251)
36b791a [HF Converter] Expect model and optim state in model_and_optim subdirectory (#253)
1f2f6f9 Log with ANSI colors in Beaker (#252)
d0ab790 No more t_max
(#247)
5653c92 rename * (unscaled)
metrics to * unscaled
60a19c3 clone single branch when possible (#250)
e9a34e8 More MoE updates (#246)
c149b73 Update images for torch 2.7.0 (#249)
6d2bb0a Added 50B Dolmino-1124 mix (#248)
53e67ce Add option to cancel run after first evals (#244)
b493d50 fix in-loop normalization with v2 (#243)
a07ef78 Add a self-contained template train script (#242)
746408e Port the 1B from old OLMo (#234)
4ec0866 Add support for label masks with numpy datasets (#241)
ecb14e0 only resume if name matches (#240)
fc84edc Add option to auto resume Comet experiments (#239)
f5d85a9 OLMo Core to HF conversion refactor (#226)
23c6cb1 clean up logging output from source mixture tests
d502b7e Mapping new ladder to old ladder (#146)
a135883 fix calculation of max steps based on epoch at the end (#236)
2f66fd9 Added warmup_fraction to all schedulers (#235)
be06aa0 B200 compatibility (#232)
0973d4d make beaker image resolution more robust (#233)
78be552 Pick the correct remote (#230)
590138d Temp disables custom read_chunk_from_array in SourceMixture (#231)
082e0b1 Fix bug when restarting a completed run (#229)
6c626f2 Update float8 API for latest torchao (#228)
8919dff Some MoE changes/additions to support auxiliary-loss-free load-balancing (#227)
26e9476 Allow train modules to not load/save optimizer state (#225)
8c20a64 run cuda gc at the end of training
a907892 Merge transformer train module configs (#224)
b47e01c Added 32B stage2 checkpoints .csv (#220)
v2.0.1
What's new
Added π
- Added information about the official 32B training run.
- Added automatic support for LL128 when running on Augusta.
Fixed β
- The official config for the 32B had unrealistic batch size settings.
- Ignore
group_overrides
for frozen parameters instead of throwing an error.
Removed π
- Removed the "fused" cross-entropy loss variant. It had a bug and consistently under-performed the native PyTorch version when compiled. See Post Incident Report: bug with fused CE loss for more information.
Commits
27b1ae8 (chore) prepare for release v2.0.1
79ebc7f Add hybrid MoE transformer architecture (#223)
bce2b5b authenticate with Docker Hub to avoid rate limits
b1e0bbd Remove fused CE loss, reorganize MoE kernels/ops (#221)
56e06ee Ignore group_overrides
for frozen params (#219)
9d80e8d Update logo for README header. (#218)
974e555 fix some typos, consistent naming
45fe007 Updated documentation (#217)
51aedcf More working config (#216)
47b2ad5 add release PR comments back in
v2.0.0
What's new
This major release introduces a few breaking changes. We've provided more information here: OLMo-core v2 design and upgrade guide.
Added π
- Added
TrainModule
abstraction withTransformerTrainModule
implementation, which encapsulates both a model and optimizer. - Added
namespace
argument toTrainer.record_metric()
. - Added support for context parallelism.
- Added support for expert parallelism with MoE models.
- Added in-loop evals for Minerva, GSM, HumanEval, MBPP (
ai2-olmo-eval==0.7.0
) - Added
CosWithWarmupAndLinearDecay
learning rate scheduler - Added
WSD
learning rate scheduler
Changed β οΈ
- The
Trainer
now takes aTrainModule
instead of a model and optimizer, and several configuration options have been moved toTransformerTrainModule
, includingrank_microbatch_size
,fused_loss
,compile_loss
,z_loss_multiplier
, andautocast_precision
. - Several
TransformerModelConfig
options have been toTransformerTrainModule
/TransformerTrainModuleConfig
, includingdp_config
,tp_config
,float8_config
, andcompile
.
Removed π
- Removed the following callbacks:
MoEHandlerCallback
,SchedulerCallback
,MatrixNormalizerCallback
,GradClipperCallback
, andFloat8HandlerCallback
.
The functionality from all of those callbacks has been moved to theTransformerTrainModule
class. - Removed the callback methods
.pre_eval_batch()
and.post_eval_batch()
.
Fixed β
- Fixed the model ladder code when training on mps or cpu device
Commits
dfa8f2b (chore) prepare for release v2.0.0
95fb084 add work-around for pytorch/ao#1871 (#205)
3ce0c58 32B Documentation (#210)
41f8ddc Add a public "official" version of our 32B train script (#214)
7e58d12 Update data paths in example to public URLs (#213)
4327bb9 upload data to r2 and updated their paths (#208)
0e6ea23 Assorted improvements (#207)
9ceb1e4 Add CUDA 12.6 images (#209)
eda3afb guard against wrapping MoE modules for AC (#206)
6e5b16f Bump ai2-olmo-eval==0.7.0 (in-loop Minerva, GSM, HumanEval, MBPP) (#204)
eccdc00 Make it easier for external users to run train scripts (#203)
da33f5b fix entrypoint steps
947a293 clean up changelog
725adf3 V2 (#202)
v1.9.0
What's new
Fixed β
- Ensure certain optimizer param group fields are not overridden by the values in a checkpoint.
Added π
- Added
instance_filter_config
field toNumpyDatasetConfig
. - Added conversion script for OLMo 2 checkpoints to Huggingface format.
- Added
BeakerCallback
. - Added logging for in-loop eval throughput
Fixed β
- Ensure certain optimizer param group fields are not overridden by the values in a checkpoint.
- Fixed issue where non-zero ranks would report partially-reduced values for training metrics.
Commits
41a7dbd (chore) prepare for release v1.9.0
d7301e6 32B scripts (#201)
d55562c Log in-loop eval throughput (#200)
260dafd Add support for BF16 optim state in SkipStepAdamW
(#148)
e522437 fix inferring sequence length
0bef5aa allow dynamic batch sizes (#170)
fa11a40 Port over instance filtering from old codebase (#157)
8ef038a update formatting of bucket distribution
c9ca78a Add a BeakerCallback
(#177)
e1cd8f6 use effective sequence length
32cb0fa Conversion script for OLMo 2 models trained with OLMo core to HuggingFace (#158)
feb57eb all-reduce train metrics (#166)
2b43d59 reset initial LR to configured value after loading (#163)
2902a9c Improve Config.from_dict
(#156)
b4cee6d ignore class name field when config from dict
c1d1a53 update DTensor imports to use public module (#153)
4594231 activate virtual env before running script
v1.8.0
What's new
Added π
- Added support for tensor parallelism. See the
TransformerConfig
class for usage. - Added more downstream tasks from the model ladder.
- Added
io.copy_dir()
function. - Added new LR schedulers:
LinearWithWarmup
,InvSqrtWithWarmup
,ConstantWithWarmup
,SequentialScheduler
. - Added option to pre-download checkpoint files from remote storage before trying to load a checkpoint.
- Added a callback for sending Slack notifications.
- Makes the MPS device work on Apple Silicon
- Added
SkipStepAdamW
optimizer. - The trainer can load model-only checkpoints now.
- Added the option to throttle checkpoint uploads to one rank from each node at a time.
- Added support for logging rich Table objects as text in source mixture datasets.
- Added
unshard_strategy
parameter tounshard_checkpoint()
function inolmo_core.distributed.checkpoint
. - Added function
load_keys()
toolmo_core.distributed.checkpoint
.
Changed β οΈ
- Changed storage of shared shard state in sharded checkpoints from smallest shard to lowest rank (normally 0).
- Changed how the trainer handles loading a checkpoint when
load_path
is provided. Nowload_path
is only used if no checkpoint is found in thesave_folder
.
Fixed β
- Added missing
weights_only=False
argument to fix loading train checkpoints with newer versions of PyTorch. - Fixed bug where GCS upload does not retry on transient failures.
- Fixed bug where source mixture datasets were truncating source files instead of randomly sampling.
- Fixed bug in source mixture datsets where sampling from small npy files raised an mmap exception due to 0 instances in the sampled index.
Commits
7899e7c (chore) prepare for release v1.8.0
907b9c5 Send Slack notification on releases (#151)
1ef7851 fix get_mock_batch()
when training on MPS again
29a468d Fix mixture dataset class (#147)
98ccb67 remove ganymede cluster
205fe90 remove deleted cluster
7ec9114 always make mock batch on CPU
7122b1d save max steps to trainer state (#143)
9a78829 Log elapsed time per eval (#149)
075a36a Make training on the MPS device work (#131)
b4a195b Add more options to the unshard_checkpoint
function to help scale (#145)
16885ab fix merge list with prefix
7b755c9 minor logging improvement
212108f Add option to throttle checkpoint uploads to one rank from each node at a time (#142)
7633461 pull fixes from 32B branch (#139)
48abe8c checkpoint hot fix (#140)
0c096e2 Handle model-only checkpoints with the trainer
9818232 move release scripts to subfolder (#137)
05ab673 update cluster list (#136)
7ccf726 add pr comments on release
0ff19d7 update citation
7519e0a Change the way load_path
is handled (#132)
03a597a limit the number of exception lines posted to Slack
c634066 include link to Beaker job with Slack noties
3505660 Make context manager set original state correctly (#126)
9e0992b Add a callback for sending Slack notifications (#125)
6d60464 fix
ee27348 Sync eval changes in OLMo/ladder-1xC to here (#122)
0789479 Add option to pre-download checkpoint to load (#123)
1380f0e add copy_dir()
io function
5cc704f Add learning rate schedulers (#119)
de5be27 don't check for beaker-py upgrades
b0103f0 Fix loading train state for newer versions of torch
5de774f updates
8474ee8 update docker image tags
d3f6f01 Update PyTorch and other deps in Docker images, change naming scheme of images (#120)
10c4978 Publish Docker images to GHCR (#118)
d6981b3 Add support for tensor parallelism and add OLMo2-26B model config / train script (#117)
aa4d188 Update table formatting
v1.7.0
What's new
Added π
- Added
key_mapping
argument toolmo_core.distributed.checkpoint.load_model_and_optim_state()
for loading checkpoints with different key names. - Added
load_key_mapping
field to the trainer, same idea as the newkey_mapping
argument above. - Added an implementation of nGPT called
NormalizedTransformer
. - Added an example showing how to convert a HuggingFace Llama 3.2 checkpoint into the right format for OLMo-core.
- Added an API for scaling RoPE embeddings.
- Added a
ModelLadder
API.
Changed β οΈ
- The
w_out
andnorm
top-level children of theTransformer
model are now wrapped together in anlm_head
module. Training scripts will have backwards compatibility with older checkpoints due to theload_key_mapping
explained above.
Fixed β
- (Optimization) Mark model input sizes as dynamic for
torch.compile()
to avoid recompile during evals or variable-sequence / batch size training. This doesn't seem to hurt throughput. - Made HTTPS and GCS IO functions more robust.
- Fixed a bug where we were always getting dolma2 tokenized validation data when generating config with DataMix.v3_small_ppl_validation.
Commits
62d2c9e (chore) prepare for release v1.7.0
cb77039 mark model ladder as a beta feature
08c8073 Adapt conversion script to work with OLMo2 models (#116)
8e716b5 Add model ladder building blocks (#114)
1647f78 Add some more tests for nGPT (#113)
37e0e88 improve docs
d68d47a Make nn configs more flexible (#112)
0bcc840 RoPE scaling, document how to convert HuggingFace checkpoints (#111)
7655a3b Add template variable to ppl validation file manifest (#110)
ca44cf4 Implement nGPT (#108)
c47df7c make IO functions more robust (#109)
4f2c8ef Update README.md
57b38ad Mark model input as dynamically sized (#105)
776e235 remove duplicate script
v1.6.3
What's new
Added π
- Added
olmo_core.distributed.checkpoint.get_checkpoint_metadata()
function. - (BETA) Added flag to compile the optimizer step. So far only tested with AdamW. May not work with other optimizers.
Fixed β
- Old ephemeral checkpoints won't be removed until after the latest ephemeral checkpoint is saved successfully.
- Made GCS uploads more robust.
- Fixed single-node training on Google Augusta cluster.
numpy.random.dirichlet()
does not always sum to 1.0, so allow for a small tolerance in validating domain weights.
Commits
9c52bea (chore) prepare for release v1.6.3
ad5e9e5 Upgrade flash-attn to v2.7.0 (#104)
b9e9193 [beta] Enable compiling optimizer step (tested with AdamW) (#103)
fdbb76e Use allclose for comparing sum of small numbers (#102)
3284742 make GCS uploads more robust (#101)
63b3f43 Update isort requirement from <5.13,>=5.12 to >=5.12,<5.14 (#93)
dcbd988 update docs and theme version
6615ba9 Bump actions/download-artifact from 3 to 4 (#100)
2e2b35b Add function to get checkpoint metadata
c0e47cc clean up Dockerfile (#99)
6300bc7 replace printing table with logging table (#98)
e522886 Don't prematurely delete old ephemeral checkpoints (#97)
dea10fd Bump actions/upload-artifact from 3 to 4 (#90)
c2fe2db skip another test when creds missing
3ea9fa2 Bump softprops/action-gh-release from 1 to 2 (#87)
5a5c17f Bump actions/checkout from 3 to 4 (#91)
9c99b9c skip some tests when missing relevant credentials (#96)
53efa8c Bump actions/setup-python from 4 to 5 (#88)
d548d3b Bump actions/cache from 3 to 4 (#86)
ab80395 add depandabot config
v1.6.2
What's new
Added π
- Added option to disable
GarbageCollectorCallback
, not that you'd want to do this usually, but I needed to run an experiment to show how important that callback is.
Fixed β
- Fixed a bug where some default callbacks could be added twice if given a different name by the user.
- Fixed a bug where some
Trainer
bookkeeping tasks may not complete before.fit()
returns.
Commits
2384472 (chore) prepare for release v1.6.2
f721fa1 Ensure all bookkeeping tasks complete (#85)
26a2c63 Some callback improvements (#84)
v1.6.1
What's new
Added π
- Added
retries
field toBeakerLaunchConfig
. - Allow running on Augusta cluster with existing train scripts.
- Added
olmo_core.utils.logging_configured()
function to check if logging has been configured.
Fixed β
- Fixed a potential distributed deadlock bug when training without a separate CPU-only bookkeeping backend.
- Removed some unnecessary host-device syncs in
olmo_core.distributed.utils
. - Added
Trainer(Config).async_bookkeeping
field to toggle async bookkeeping.
Commits
cae88f5 (chore) prepare for release v1.6.1
83db5f7 Some fixes/improvements around synchronous bookkeeping operations (#83)
c435c94 increase timeout for CI checks
4a56200 update cluster list (#82)
e27ba74 Update throughput numbers, add logging_configured()
util function (#81)
bec0a3c Allow running on Augusta cluster (#80)
c7c3a5a Set env vars for Augusta cluster
b9351e2 Add retries
field to BeakerLaunchConfig
(#79)
v1.6.0
What's new
Added π
- Added option to compile the trainer's loss function (
Trainer.compile_loss
). - Added
SourceMixtureDataset
for composing a training mixture based on ratios of source datasets. - Added
NumpyFSLDatasetMixture
for constructing aNumpyDatasetBase
from aSourceMixtureDataset
. Note this is only supported for FSL datasets. - Added tests for
SourceMixture*
andNumpyFSLDatasetMixture
. - Added
DownstreamEvaluatorCallbackConfig
class for running in-loop downstream eval via OLMo-in-loop-evals.
Changed β οΈ
- Moved some types into
olmo_core.data.types
to avoid some circular dependencies.
Fixed β
- Made GCS client more robust by automatically retrying timeout errors for most operations.
Commits
29e1276 (chore) prepare for release v1.6.0
da39e97 Add note about optional dependencies
81b1249 Missed _bust_index_cache in one spot (#78)
00d34f6 Add option to compile loss function, move logits FP32 casting into loss function (#77)
4928f82 Adds mixing loader for FSL datasets (#70)
ecb0686 Allow stopping the experiment on keyboard int
41400c4 Add Llama 8B config (#76)
282c120 Update Docker build (#75)
55d261e Make GCS client more robust (#74)
3fe59b6 Add a callback for downstream evals, update Docker builds (#73)
ecd523e include release chore commit in release notes