README IN PROGRESS
This repo contains the code for the paper SMCLM: Semantically Meaningful Causal Language Modeling for Autoregressive Paraphrase Generation, by .... We propose an auto-generative model of paraphrase generation based on the pre-trained OpenAI's GPT-2 model and the SentenceTransformers encoding.
First, create a fresh virtualenv and install the dependencies:
pip install -r requirements.txt
You can also use the ./docker/Dockerfile
file to build the docker image and run code in a container.
Then download (or generate) the datasets splits you want to work with:
Download our split of CNN News
Data zip should be unzipped into ./.data/<dataset_name>
, eg ./.data/quora
.
To get data files you can also use the generation script and generate them by yourself:
python3 ./dataset/generate.py --dataset_name quora --valid_size 1_000 --test_size 10_000 --seed 42 --output_dir <path>
The script generates the <dataset_name>
folder in output_path
(default ./data)
, containing {train,dev,test}.tsv
and unsupervised.txt
files.
Files {train,dev}.tsv
are used for supervised method training (used only for reproduce competitive methods results) and test.tsv
for evaluation purposes.
The unsupervised.txt
file is used for unsupervised methods training and for the BLeuMacaw model fine-tuning.
Available datasets names:
- quora
- msrp
- paws (only Wiki)
- mscoco
Each line in the {train,dev}.tsv
files contains \t
-separated pairs (input, paraphrase).
The test.tsv
file contains lines with \t
-separated pairs (input,[paraphrase0, ..., paraphraseN]).
Each list of paraphrases is generated based of raw data files rules. We pay great attention to calculate the metrics like BLEU or ROGUE with using as many reference paraphrases as possible.
The unsupervised.txt
file contains simply sentence per a line.
Implement a .dataset/abstract.ParaCorpus
class.
For adapt the BleuMacaw model to sentence data type, we pre-trained the model on the OpenWebText corpus (https://huggingface.co/datasets/openwebtext). The OpenWebText corpus is an open-source recreation of the WebText corpus (orginally used to train a GPT2 model). The text is web content extracted from URLs shared on Reddit with at least three upvotes (40GB). We used the nltk.sent_tokenizer to fetch sentences from text. For optimization purpose, we limit the number of articles from the corpus to 1 000 000 (approx. 38 261 000 sentences).
To pre-train a BleuMacaw model please use the example command:
python3 ./run_clm.py --sentence_transformer paraphrase_mpnet_base_v2 --data_limit 1_000_000 --epochs 5 --batch_size 32 --checkpoint_steps 100_000 --warmup_steps 100_000
Steps done with the command:
- Download the OpenWebText data files and extract them in cache folder (about 40GB of text data)
- Download GPT-2 and SentenceTransformers models
- Tokenize the text data and cache them in cache folder (for 768 length vectors and 1M articles limit it's approx 120GB of data). The cached files are used for next training script run when te same SentenceEncoder and data articles limit is used.
- Start training the model and save checkpoint
unsupervised.txt
python3 ./run_clm.py --dataset_name quora --epochs 5 --batch_size 32 --checkpoint_steps 10_000 --warmup_steps 10_000
HuggingFace's GPT-2 implementation is available in five different sizes (https://huggingface.co/transformers/v2.2.0/pretrained_models.html): small (117M), medium (345M), large (774M), xl (1558M) and a distilled version of the small checkpoint: distilgpt-2 (82M). For optimization purpose, we use a small GPT-2 model. The further research should include biggest bersions of the model.
- parahrase-mpnet-base-v2
- all-mpnet-base-v2
- parahrase-distilroberta-base-v2
- multi-qa-distilbert-dot-v1 (0.2M articles)
BLEU, selfBLEU, oriBLEU, ROGUE-{1,2,L}, Meteor, s-t sim-{input-generated, best-refs}, BERTScore, stanford-grammar,
TODO: <miara ratio podobieństwo generated/ sim-sem generated-input>
nltk.word_tokenizer nad lower_cased.