BleuMacaw: GPT-2 meets SentenceTransformers for Paraphrase Generation

README IN PROGRESS

This repo contains the code for the paper SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation, by .... We propose an auto-generative model of paraphrase generation based on the pre-trained OpenAI's GPT-2 model and the SentenceTransformers encoding.

Installation

First, create a fresh virtualenv and install the dependencies:

pip install -r requirements.txt

You can also use the ./docker/Dockerfile file to build the docker image and run code in a container.

Datasets preparation

Then download (or generate) the datasets splits you want to work with:

Download our split of QQP

Download our split of CNN News

Download our split of MSCOCO

Data zip should be unzipped into ./.data/<dataset_name>, eg ./.data/quora.

Datasets generation

To get data files you can also use the generation script and generate them by yourself:

python3 ./dataset/generate.py --dataset_name quora --valid_size 1_000 --test_size 10_000 --seed 42 --output_dir <path>

The script generates the <dataset_name> folder in output_path (default ./data), containing {train,dev,test}.tsv and unsupervised.txt files. Files {train,dev}.tsv are used for supervised method training (used only for reproduce competitive methods results) and test.tsv for evaluation purposes. The unsupervised.txt file is used for unsupervised methods training and for the BLeuMacaw model fine-tuning.

Available datasets names:

quora
msrp
paws (only Wiki)
mscoco

Script automatically downloads the raw data files from the Huggingface Hub and prepares splits for training, validation, testing. It also generates data for unsupervised training.

Files format

Each line in the {train,dev}.tsv files contains \t-separated pairs (input, paraphrase).

The test.tsv file contains lines with \t-separated pairs (input,[paraphrase0, ..., paraphraseN]). Each list of paraphrases is generated based of raw data files rules. We pay great attention to calculate the metrics like BLEU or ROGUE with using as many reference paraphrases as possible.

The unsupervised.txt file contains simply sentence per a line.

Create own dataset

Implement a .dataset/abstract.ParaCorpus class.

Model training

Pre-training on large text corpus

For adapt the BleuMacaw model to sentence data type, we pre-trained the model on the OpenWebText corpus (https://huggingface.co/datasets/openwebtext). The OpenWebText corpus is an open-source recreation of the WebText corpus (orginally used to train a GPT2 model). The text is web content extracted from URLs shared on Reddit with at least three upvotes (40GB). We used the nltk.sent_tokenizer to fetch sentences from text. For optimization purpose, we limit the number of articles from the corpus to 1 000 000 (approx. 38 261 000 sentences).

To pre-train a BleuMacaw model please use the example command:

python3 ./run_clm.py --sentence_transformer paraphrase_mpnet_base_v2 --data_limit 1_000_000 --epochs 5 --batch_size 32 --checkpoint_steps 100_000 --warmup_steps 100_000

Steps done with the command:

Download the OpenWebText data files and extract them in cache folder (about 40GB of text data)
Download GPT-2 and SentenceTransformers models
Tokenize the text data and cache them in cache folder (for 768 length vectors and 1M articles limit it's approx 120GB of data). The cached files are used for next training script run when te same SentenceEncoder and data articles limit is used.
Start training the model and save checkpoint

Fine-tuning

unsupervised.txt

python3 ./run_clm.py --dataset_name quora --epochs 5 --batch_size 32 --checkpoint_steps 10_000 --warmup_steps 10_000

Choosing GPT-2 and SentenceTransfomers models

GPT-2

HuggingFace's GPT-2 implementation is available in five different sizes (https://huggingface.co/transformers/v2.2.0/pretrained_models.html): small (117M), medium (345M), large (774M), xl (1558M) and a distilled version of the small checkpoint: distilgpt-2 (82M). For optimization purpose, we use a small GPT-2 model. The further research should include biggest bersions of the model.

SentenceTransformers

parahrase-mpnet-base-v2
all-mpnet-base-v2
parahrase-distilroberta-base-v2
multi-qa-distilbert-dot-v1 (0.2M articles)

Evaluation

Metrics

BLEU, selfBLEU, oriBLEU, ROGUE-{1,2,L}, Meteor, s-t sim-{input-generated, best-refs}, BERTScore, stanford-grammar,

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
baseline		baseline
dataset		dataset
docker		docker
grammar		grammar
vae		vae
README.md		README.md
config-dgx.py		config-dgx.py
config.py		config.py
coreNLPServer.py		coreNLPServer.py
grammar_measure.py		grammar_measure.py
modified_gpt2.py		modified_gpt2.py
parse.py		parse.py
requirements.txt		requirements.txt
run_clm.py		run_clm.py
run_evaluation.py		run_evaluation.py
run_paraphrase.py		run_paraphrase.py
start_chat.py		start_chat.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BleuMacaw: GPT-2 meets SentenceTransformers for Paraphrase Generation

Table of contents

Installation

Datasets preparation

Datasets generation

Files format

Create own dataset

Model training

Pre-training on large text corpus

Fine-tuning

Choosing GPT-2 and SentenceTransfomers models

GPT-2

SentenceTransformers

Evaluation

Metrics

stanford-grammar

sim-{input-generated, best-refs}

Citation

About

Uh oh!

Releases

Packages

Languages

mmichall/bleu-macaw

Folders and files

Latest commit

History

Repository files navigation

BleuMacaw: GPT-2 meets SentenceTransformers for Paraphrase Generation

Table of contents

Installation

Datasets preparation

Datasets generation

Files format

Create own dataset

Model training

Pre-training on large text corpus

Fine-tuning

Choosing GPT-2 and SentenceTransfomers models

GPT-2

SentenceTransformers

Evaluation

Metrics

stanford-grammar

sim-{input-generated, best-refs}

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages