Welcome to the repository for the McGill University COMP550 Natural Language Processing project "llmbda". This repository contains all the code and resources required to replicate the findings and experiments presented in the report.
Since late 2022, large language models (LLMs) like ChatGPT have gained popularity in research and industry uses due to their ability to perform diverse natural language processing tasks with human-like proficiency. In this study, we investigate the ability to understand logic of various large language models, including open-source ones like Llama 2 and closed-source models such as GPT and Gemini, on the task of parsing semi-formal natural language into propositional logic. Our analysis compared different alignment methods: zero-shot and few-shot prompting, and supervised fine-tuning.
We observe that LLMs performed well in these tasks, LLMs can understand logical semantics with appropriate training data, especially when fine-tuned with a dedicated dataset. However, chat-instruct and general-purpose LLMs suffer from inconsistent performance on this task.
Our findings suggests that there may be potential downstream engineering and research use cases for LLMs for semantic related tasks, given its understanding of logic.
These make calls to the official OpenAI and Google APIs, and do not have specific system requiremenets.
System requirements:
- Ubuntu (Some libraries used in huggingface's Transformer's API require Linux)
- CUDA (torch devices are set to CUDA)
Dependency management is done using poetry, dependencies can be installed via:
poetry install
To run a single script, use:
poetry run python3 <?.py>
To enter activate the virtual environment for your shell, run:
poetry shell
Experiment scritps are located under /experiments
, each experiment have their own scritpt. Only the Llama finetuning requires local trainning.
Make note of the Llama fine tune scripts, where it uses the sft_finetune.py
scripts from Huggingface. To run it with the same options as the experiment, run the bash script.
experiments/llama2/finetune.sh
Additonally, you can experiment with different options in the bash script.
The script already specified memory efficient training (4 bit quantization + PEFT), our setup ran on a single RTX 3060 12GB. To further reduce memory usage, try reducing the batch size in the bash script.
Inferences with GPT and Gemini can be done with their corresponding scripts under /experiments
, ensure that you have a valid API key.
Llama inferences can be done locally by running llama2_chat.py
for the chat instruct llama2-chat model, and llama2_finetuned.py
for the finetuned model. The latter script assumes that you have a finetuned model locally, with the default model path being the same directory.
Evaluation is pipelined with the eval.py
script. Simply run eval.py
with desired options
model_name the name of the model
label_path The path where the correct labeled csv is located
pred_path The path where the predictions csv is located
--log_result Option to indicate whether we save the result in a text file. Default is False.
--ans_text_field Column name of the correct labels in the labeled csv. Default is 'object_tree'.
--pred_text_field Column name of the correct labels in the prediction csv. Default is 'predictions'.
--join_on The key column name to join the two csv on. Default is 'index'.
a sample eval script is located in the justfile
, you can run it with
just eval {{model_name}} {{pred_path}}
For a comprehensive understanding of our project, methodologies, and detailed results, please refer to our project report.