Project for traning Aspect-Based Sentiment / Emotion analyzis BERT-based models. Implementation based on ABSA-PyTorch repositroy with custom extensions, like preprocessing data from .xlsx and .csv format, sentence segmentation based on the sentence-splitter python package, etc.
absa_babel_finetune/preprocessors/excel_to_sentences.py
- Process input .xlsx files. Only the columns containing ID and text of the input files are parsed.
- The output file is a table in .csv format, which also contains the ID and the sentences generated by segmenting the text in the following format:
- A row is a sentence, where ID is the original ID of the text + the character "_" + the sentence line number format. The original ID of the text can therefore be recovered by stripping the last digit.
examples_predict.py
Assign predictions to texts previously segmented into sentences (with e.g. preprocessors/sentence_splitter.py
). The script uses the DataPreparator
class available in preprocessors/prepeare_data_for_prediction.py
and the Predictor
class from src/prediction.py
.
The former is responsible for recognizing the Named Entities in the sentences specified in the text_column
variable of the config.py
configuration file for which prediction can be made. As an internal representation, it stores the received data in a python dictonary, which is currently not serialized at runtime.
The latter's task is to assign prediction to the prepared (Named Entity + sentence) pairs using the model initialized (BERT) with the model_parameters
options in config.py
, using the previous PyTorch checkpoint specified also there.
The output is an .xlsx file with the IDs, text fields, and predictions given by the model (the latter is stored in a column named in the predictions_column
variable of config.py
). The name of the output file is the original filename, extended with a '_predictions' suffix.
examples_train.py
The script uses the Trainer class of src/training.py
, which initializes the given BERT model with the parameters stored in config.py
and then fine-tunes the original model in a standard way (using early stop).
The output is a PyTorch checkpoint that can be loaded later at prediction time.
The model training is preceded by the sentence segmentation as already described if the data is not already in this required format).
This is followed by Named Entity recognition (with the use of the specified spaCy
language models --> config.py
/ spacy_model_name
variable), and the construction of sentence + Named Entity pairs, also as already described.
If necessary, the train and test datasets can be created manually using the stratified_split
function at preprocessors/stratified_split.py
, which retains the label distributions specific to the original data set in both the train and test sets.
MIT