8000 GitHub - shaochenze/ScoringRulesLM
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
8000

shaochenze/ScoringRulesLM

Repository files navigation

Language Generation with Strictly Proper Scoring Rules

The official repository for the ICML 2024 paper "Language Generation with Strictly Proper Scoring Rules". Based on the open-source toolkit fairseq-0.12.2, we implemented scoring rules based losses in scoring_rule_loss.py. These losses do not perform as well as the cross-entropy loss when training models from scratch, but they show substantial improvements when fine-tuning models pre-trained with cross-entropy.

Requirements

  • Python version >= 3.8
  • Pytorch version >= 1.10.0
  • Build fairseq with python setup.py build_ext --inplace

Replicate the TED results

Follow these instructions to replicate results on the TED dataset. For other datasets, adjust the hyper-parameters as per the guidelines in the paper.

Pre-processing

We use the tokenized TED dataset released by VOLT, which can be downloaded here and pre-processed into subword units by prepare-ted-bilingual.sh.

For convenience, we include the pre-processed TED Fr-En dataset in this repository. Convert it into the fairseq format by running:

TEXT=./data
python preprocess.py --source-lang fr --target-lang en \
        --trainpref $TEXT/fr-en.train \
        --validpref $TEXT/fr-en.valid \
        --testpref $TEXT/fr-en.test \
        --destdir data-bin/ted_fren \
        --joined-dictionary  --workers 16

Training

Pre-train the Transformer model for 13k steps using the cross-entropy loss, and then fine-tune with the Brier score for an additional 5k steps. To fine-tune with the Spherical score instead, set --scoring-rule to spherical.

data_dir=data-bin/ted_fren
save_dir=output/fren_brier

# Pre-train with the logarithmic score for 13k steps
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py $data_dir \
    --scoring-rule logarithmic --dropout 0.3 --fp16  --save-dir $save_dir \
    --arch transformer_wmt_en_de  --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --lr 0.0007 \
    --weight-decay 0.0 --criterion scoring_rule_loss --score-smoothing 0.1 --max-tokens 4096 --update-freq 1\
    --no-progress-bar --log-format json --log-interval 100 --save-interval-updates 500 \
    --max-update 13000 --keep-interval-updates 5 --no-epoch-checkpoints

# Fine-tune with the brier score for 5k steps
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py $data_dir \
    --scoring-rule brier --dropout 0.3 --fp16  --save-dir $save_dir \
    --arch transformer_wmt_en_de  --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --lr 0.0007 \
    --weight-decay 0.0 --criterion scoring_rule_loss --score-smoothing 0.1 --max-tokens 4096 --update-freq 1\
    --no-progress-bar --log-format json --log-interval 100 --save-interval-updates 500 \
    --max-update 18000 --keep-interval-updates 5 --no-epoch-checkpoints

python average_checkpoints.py --inputs $save_dir \
 --num-update-checkpoints 5  --output $save_dir/average-model.pt

For comparison, you may also train a baseline Transformer model for 18k steps using the same procedure.

data_dir=data-bin/ted_fren
save_dir=output/fren_base

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python train.py $data_dir \
    --scoring-rule logarithmic --dropout 0.3 --fp16  --save-dir $save_dir \
    --arch transformer_wmt_en_de  --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
    --lr 0.0007 \
    --weight-decay 0.0 --criterion scoring_rule_loss --score-smoothing 0.1 --max-tokens 4096 --update-freq 1\
    --no-progress-bar --log-format json --log-interval 100 --save-interval-updates 500 \
    --max-update 18000 --keep-interval-updates 5 --no-epoch-checkpoints

python average_checkpoints.py --inputs $save_dir \
 --num-update-checkpoints 5  --output $save_dir/average-model.pt

The above commands assume 8 GPUs on the machine. When the number of GPUs is different, adapt --update-freq to make sure that the batch size is 32k.

Inference

Run the following command for inference.

python generate.py data-bin/ted_fren  --path output/fren_brier/average-model.pt --gen-subset test --beam 5 --batch-size 100 --remove-bpe --lenpen 1 > out
# because fairseq's output is unordered, we need to recover its order
grep ^H out | cut -f1,3- | cut -c3- | sort -k1n | cut -f2- > pred.en
sed -r 's/(@@ )|(@@ ?$)//g' data/fr-en.test.en > ref.en
perl multi-bleu.perl ref.en < pred.en

Expected BLEU scores are ~40.8 for the Transformer baseline and ~41.4 for models fine-tuned with Brier/Spherical scores.

Citation

If you find the resources in this repository useful, please cite as:

@inproceedings{scoringrule,
  title = {Language Generation with Strictly Proper Scoring Rules},
  author= {Chenze Shao and Fandong Meng and Yijin Liu and Jie Zhou},
  booktitle = {Proceedings of ICML 2024},
  year = {2024},
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0