This repository provides code for training and testing models for grammatical error correction on code-switching text. It is based on and forked from the GECTOR repository (https://github.com/grammarly/gector). Main contributions to the code can be found in the utils
folder except for utils/helpers.py
and utils/preprocess_data.py
. The latter two files and the train.py
and predict.py
files are modified versions of the files in the GECTOR repository to adapt the code for code-switching.
The following command installs all necessary packages:
pip install -r requirements.txt
The project was tested using Python 3.7.
Optionally, install ERRANT to evaluate the output of the model:
pip install errant
Our experiment used datasets based on the following datasets:
- All the public GEC datasets used in the paper can be downloaded from here.
- Synthetically created datasets can be generated/downloaded here.
Once the above datasets are downloaded, the following code can be used to generate a dataset similar to that used in the paper.
To perform data augmentation, simply run:
python utils/substitute_gcm.py INPUT_INV_M2_FILE OUTPUT_CS_INCORR_PATH OUTPUT_CS_CORR_PATH SRC_LANG TGT_LANG SELECTION_METHOD
Arguments:
INPUT_INV_M2_FILE
: Inversed M2 file generated using ERRANT. This M2 file should indicate the edits required to generate a incorrect sentence from a correct sentence. The file can be generated usingerrant_parallel -orig <correct_file> -cor <incorrect_file> -out <out_m2>
.OUTPUT_CS_INCORR_PATH
: Output path for incorrect (errornous) CSW parallel textOUTPUT_CS_CORR_PATH
: Output path for correct (error-free) CSW parallel textSRC_LANG
: Source language used by the input M2 file. Language should be denoted using ISO639-1 language codes for supported languagesTGT_LANG
: Target language for CSW component. Language should be denoted using ISO639-1.SELECTION_METHOD
: Method used to select component to translate for CSW. Possible selection methods include:ratio-token
: randomly select tokens from the sentence based on a reference corpus distributioncont-token
: randomly select a string of continuous tokens to match the ratio of code-switched textrand-phrase
: randomly select phrases from the sentenceratio-phrase
: select the phrase that has a length closest to the reference corpus distributionoverlap-phrase
: select phrases which intersect with the least number of edit spansnoun-token
: randomly selects a single token with a NOUN or PROPN POS tag based on SpaCy
Before training the model, the data has to be preprocessed and converted to special format with the command:
python utils/preprocess_data.py -s SOURCE_FILE -t TARGET_FILE -o OUTPUT_FILE
To train the model, simply run:
python train.py --train_set TRAIN_SET_PATH --dev_set DEV_SET_PATH \
--model_dir MODEL_DIR_PATH
There are a nu 8028 mber of parameters that can be specified. Among them:
cold_steps_count
the number of epochs where we train only last linear layertransformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert}
model encodertn_prob
probability of getting sentences with no errors; helps to balance precision/recallpieces_per_token
maximum number of subwords per token; helps not to get CUDA out of memory
All parameters that used for training and evaluating is exactly the same as GECTOR which can be found here.
To generate the CSW Lang-8 dataset (used as our test dataset) from the Lang-8 dataset, we can use the filter_cs.py
script to filter out sentences containing CSW text.
python data_gen/filter_cs.py <lang8_input_path.dat> -out <json_output_path.json>
We can then sort the sentences based on CSW language using sort_lang.py
. The json files can then be converted to ERRANT style m2 files using json_to_m2.py
.
cd data_gen
mkdir l1s_cor
python sort_lang.py <json_output_path.json>
python json_to_m2.py l1s_cor/<language.json>
To generate the human reannotated dataset from the CSW Lang-8 dataset, we need to install ERRANT:
python3 -m venv errant_env
source errant_env/bin/activate
pip install -U pip setuptools wheel
pip install errant==2.3.3
python3 -m spacy download en_core_web_sm
We can then run the create_human.py
script to generate the human reannotated dataset.
cd data_gen
python create_human.py l1s_cor/<language.json> <language>.csw.test.id.m2 <output.m2>
The output of the create_human.py
script is the m2 file used for human re-annotated dataset evaluation.
To run your model on the input file use the following command:
python predict.py --model_path MODEL_PATH [MODEL_PATH ...] \
--vocab_path VOCAB_PATH --input_file INPUT_FILE \
--output_file OUTPUT_FILE
Among parameters:
min_error_probability
- minimum error probability (as in the paper)additional_confidence
- confidence bias (as in the paper)special_tokens_fix
to reproduce some reported results of pretrained models
For evaluation, we use ERRANT.
errant_compare -hyp <hyp_m2> -ref <ref_m2>
The code for GECTOR is distributed under Apache 2.0 license. All code generated from this project (including but not limited to the code used to perform data augmentation, generate CSW Lang-8 Dataset, and generate Human Reannotated Dataset) is is distributed under the CC BY-NC 4.0 license.