8000 GitHub - mmichall/bleu-macaw: BleuMacaw: GPT-2 and SentenceTransformers for Paraphrases Generation
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

mmichall/bleu-macaw

Repository files navigation

BleuMacaw: GPT-2 meets SentenceTransformers for Paraphrase Generation

Graphical Model diagram for HRQ-VAE

README IN PROGRESS

This repo contains the code for the paper SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation, by .... We propose an auto-generative model of paraphrase generation based on the pre-trained OpenAI's GPT-2 model and the SentenceTransformers encoding.

Table of contents

Installation

First, create a fresh virtualenv and install the dependencies:

pip install -r requirements.txt

You can also use the ./docker/Dockerfile file to build the docker image and run code in a container.

Datasets preparation

Then download (or generate) the datasets splits you want to work with:

Download our split of QQP

Download our split of CNN News

Download our split of MSCOCO

Data zip should be unzipped into ./.data/<dataset_name>, eg ./.data/quora.

Datasets generation

To get data files you can also use the generation script and generate them by yourself:

python3 ./dataset/generate.py --dataset_name quora --valid_size 1_000 --test_size 10_000 --seed 42 --output_dir <path>

The script generates the <dataset_name> folder in output_path (default ./data), containing {train,dev,test}.tsv and unsupervised.txt files. Files {train,dev}.tsv are used for supervised method training (used only for reproduce competitive methods results) and test.tsv for evaluation purposes. The unsupervised.txt file is used for unsupervised methods training and for the BLeuMacaw model fine-tuning.

Available datasets names:

  1. quora
  2. msrp
  3. paws (only Wiki)
  4. mscoco
Script automatically downloads the raw data files from the Huggingface Hub and prepares splits for training, validation, testing. It also generates data for unsupervised training.

Files format

Each line in the {train,dev}.tsv files contains \t-separated pairs (input, paraphrase).

The test.tsv file contains lines with \t-separated pairs (input,[paraphrase0, ..., paraphraseN]). Each list of paraphrases is generated based of raw data files rules. We pay great attention to calculate the metrics like BLEU or ROGUE with using as many reference paraphrases as possible.

The unsupervised.txt file contains simply sentence per a line.

Create own dataset

Implement a .dataset/abstract.ParaCorpus class.

Model training

Pre-training on large text corpus

For adapt the BleuMacaw model to sentence data type, we pre-trained the model on the OpenWebText corpus (https://huggingface.co/datasets/openwebtext). The OpenWebText corpus is an open-source recreation of the WebText corpus (orginally used to train a GPT2 model). The text is web content extracted from URLs shared on Reddit with at least three upvotes (40GB). We used the nltk.sent_tokenizer to fetch sentences from text. For optimization purpose, we limit the number of articles from the corpus to 1 000 000 (approx. 38 261 000 sentences).

To pre-train a BleuMacaw model please use the example command:

python3 ./run_clm.py --sentence_transformer paraphrase_mpnet_base_v2 --data_limit 1_000_000 --epochs 5 --batch_size 32 --checkpoint_steps 100_000 --warmup_steps 100_000

Steps done with the command:

  1. Download the OpenWebText data files and extract them in cache folder (about 40GB of text data)
  2. Download GPT-2 and SentenceTransformers models
  3. Tokenize the text data and cache them in cache folder (for 768 length vectors and 1M articles limit it's approx 120GB of data). The cached files are used for next training script run when te same SentenceEncoder and data articles limit is used.
  4. Start training the model and save checkpoint

Fine-tuning

unsupervised.txt

python3 ./run_clm.py --dataset_name quora --epochs 5 --batch_size 32 --checkpoint_steps 10_000 --warmup_steps 10_000

Choosing GPT-2 and SentenceTransfomers models

GPT-2

HuggingFace's GPT-2 implementation is available in five different sizes (https://huggingface.co/transformers/v2.2.0/pretrained_models.html): small (117M), medium (345M), large (774M), xl (1558M) and a distilled version of the small checkpoint: distilgpt-2 (82M). For optimization purpose, we use a small GPT-2 model. The further research should include biggest bersions of the model.

SentenceTransformers

  • parahrase-mpnet-base-v2
  • all-mpnet-base-v2
  • parahrase-distilroberta-base-v2
  • multi-qa-distilbert-dot-v1 (0.2M articles)

Evaluation

Metrics

BLEU, selfBLEU, oriBLEU, ROGUE-{1,2,L}, Meteor, s-t sim-{input-generated, best-refs}, BERTScore, stanford-grammar,

TODO: <miara ratio podobieństwo generated/ sim-sem generated-input>

nltk.word_tokenizer nad lower_cased.

stanford-grammar

sim-{input-generated, best-refs}

Citation

Releases

No releases published

Packages

No packages published

Languages

0