The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe Large Language Models (LLMs) and extrapolate their future capabilities. This is a modified version of the original BIG-bench repository. The modifications were made by Viktoria Köhnlein in the course of the development of the Master Thesis "Bewertung und Erweiterung eines linguistischen Benchmark-Frameworks für große Sprachmodelle (LLMs)".
- Master Thesis: Modification of BIG-bench
- Contributions
- Use of Modified Repository
- Repository Structure
- General Information on BIG-bench
- Citation and Attribution
This repository forms part of the master thesis "Bewertung und Erweiterung eines linguistischen Benchmark-Frameworks für große Sprachmodelle (LLMs)". The practical part of the master thesis focuses on the usage of Google BIG-bench. The primary goals include analyzing capabilities of LLMs and designing benchmarks to challenge these models.
Benchmark tasks were developed according to the requirements specified for BIG-bench. Additionally, the LLM Llam 7A3E a was integrated alongside those LLMs already included and available from the initial installation of BIG-bench.
-
Development of Three Tasks:
- V1: A benchmark designed to investigate the comprehension of semantic ambiguity in LLMs. The task is to identify the ambiguous word in a sentence.
- V2: A benchmark designed to investigate the comprehension of semantic ambiguity in LLMs. The task is to choose the sentence, which uses an ambiguous word correctly in context.
- V3: A benchmark designed to investigate the comprehension of syntactic ambiguity in LLMs. The task involves identifying underlying structures and dependencies in garden-path-sentences.
Each task includes multiple subtasks to test specific aspects of model reasoning and language comprehension.
-
Integration of Llama:
- Various models from the meta-llama/Llama-2 family have been successfully integrated into Google BIG-bench.
Use the notebook Setup_ModifiedRepository.ipynb to evaluate the available LLMs (including Llama) with the benchmarks created in the course of the master thesis.
The notebook is located in the folder notebooks. Before using the notebook, please follow the instructions provided in the "README.md" file in the folder "notebooks".
Only the parts of the original repository relevant to the evaluation of the created benchmarks were retained. The folders listed below are relevant to the modifications made and the functionality of the repository:
- scripts_modification: Contains Python scripts relevant to the practical part of the master thesis
- bigbench/benchmark_tasks: Includes the three designed benchmarks
- bigbench/models: Contains the definitions of the model classes used for evaluation. This folder also includes the script llama_model.py, which handles the implementation of the LLM Llama
- datasets: Contains the datasets used for task creation
- results_excel: Contains Excel files summarizing the evaluation results. Additionally, the results for each benchmark are stored in their respective folders. The files in the results folder correspond to the evaluation with the setting "append_choices_to_input": true, while the files in the [Benchmark_Name]_acti_results folder correspond to evaluation with the setting "append_choices_to_input": false.
For further information on BIG-bench, I refer to the original repository BIG-bench. General instructions regarding installation, task implementation etc. can be accessed here.
This repository is a modification of the original BIG-bench repository.
If you use this work, please also cite the original BIG-bench analysis paper:
# TMLR
@article{srivastava2023beyond,
title={Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models},
author={BIG-bench authors},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2023},
url={https://openreview.net/forum?id=uyTL5Bvosj},
note={}
}