mimir-evaluation-suite

Caution

This repository and its contents are part of the paper The Impact of Copyrighted Material on Large Language Models: A Norwegian Perspective, and are released for transparency and reproducibility purposes only. We discourage the use of this evaluation suite in its current state. An improved version of the codebase is now available and maintained as NorEval.

Codebase

The evaluation codebase relies on the lm-evaluation-harness framework codebase (version 0.4.2). In particular, each task in the mimir-evaluation-suite is integrated into the framework with the help of .yaml configuration files, which allow for high task customizability.

Tasks

The mimir-evaluation-suite includes various tasks, which are divided the (i) text classification, (ii) question answering, (iii) ranking sentence pairs, and (iv) text generation groups.

Each table below has the following columns:

Task name (default): a task name for a configuration file with one default prompt. Even if the task has both Bokmål and Nynorsk dataset versions, the default one is set to Bokmål.
Bokmål / Nynorsk (N prompts): task names for configuration files with N prompts for the Bokmål and Nynorsk dataset versions, respectively.
- N is in range between 4 and 6. Tasks in the "ranking sentence pairs" group do not utilize any prompts; the evaluation design includes selecting the most probable sentence.
- ❌ means that the task does not have a dataset for a given written standard.
0-shot / few-shot: support for the zero-shot and few-shot evaluation setups, respectively.
- ✅ means that one can run evaluation in a given setup.
- ❌ denotes that a given setup is not supported due to the lack of the training or validation set to sample the demonstration examples from.
Task category: task formulation or task category.
HuggingFace: a link to the dataset on HuggingFace.

Note

(1) Some HF datasets from the `mimir-project` repository are currently not available; to be released later.

(2) The `.yaml` configuration files are created according to version 0.4.2 of the `lm-evaluation-harness` framework. This repository will not updates beyond version 0.4.2. For an up-to-date codebase, use NorEval. However, you can still use 0.4.2 for empirical evaluation experiments.

Text classification

Task name (default)	Bokmål/Nynorsk (N prompts)	0-shot / few-shot	Task category	HuggingFace
`norec_sentence`	`norec_sentence_nb` / ❌	✅ / ✅	Sentiment analysis	ltg/norec_sentence
`norec_document`	`norec_document_nb` / ❌	✅ / ✅	Sentiment analysis	ltg/norec_document

Question answering

Task name (default)	Bokmål / Nynorsk (N prompts)	0-shot / few-shot	Task category	HuggingFace
`norquad`	`norquad_nb` / ❌	✅ / ✅	Reading comprehension	ltg/norquad
`belebele_nob_Latn`	`norbelebele_nb` / ❌	✅ / ❌	Reading comprehension	facebook/belebele
`nrk`	`nrk_nb` / `nrk_nn`	✅ / ❌	World knowledge	mimir-project/nrk
`noropenbookqa`	`noropenbookqa_nb` / `noropenbookqa_nn`	✅ / ✅	World knowledge	mimir-project/noropenbookqa
❌ (use `fact` as part of the input)	`noropenbookqa_nb_use_fact` / `noropenbookqa_nn_use_fact`	✅ / ✅	World knowledge	mimir-project/noropenbookqa
`norcommonsenseqa`	`norcommonsenseqa_nb` / `norcommonsenseqa_nn`	✅ / ❌	Commonsense reasoning	mimir-project/norcommonsenseqa
`nortruthfulqa_mc`	`nortruthfulqa_mc_nb` / `nortruthfulqa_mc_nn`	✅ / ❌	Fairness & truthfulness	mimir-project/nortruthfulqa_mc

Ranking sentence pairs

Task name (default)	Bokmål/Nynorsk (N prompts)	0-shot / few-shot	Task category	HuggingFace
`mimir_bias`	❌ / ❌	✅ / ❌	Fairness & truthfulness	mimir-project/mimir-bias
`ncb`	❌ / ❌	✅ / ❌	Norwegian language: grammar, punctuation, and idioms	hcfa/ncb

Text generation

Task name (default)	Bokmål/Nynorsk (N prompts)	0-shot / few-shot	Task category	HuggingFace
`noridiom`	`noridiom_nb` / `noridiom_nn`	✅ / ❌	Norwegian language: grammar, punctuation, and idioms	mimir-project/noridiom
`ask_gec`	`ask_gec_nb` / ❌	✅ / ✅	Norwegian language: grammar, punctuation, and idioms	ltg/ask-gec
`norsumm`	`norsumm_nb` / `norsumm_nn`	✅ / ❌	Text summarization	mimir-project/norsumm
`schibsted_vg`	`schibsted_vg_nb` / ❌	✅ / ✅	Headline generation	Schibsted/vg-front-title
`nortruthfulqa_gen`	`nortruthfulqa_gen_nb` / ❌	✅ / ❌	Fairness & truthfulness	mimir-project/nortruthfulqa_gen
`tatoeba_eng_nno` (English → Nynorsk)	❌ / `tatoeba_eng_nno_nn`	✅ / ✅	Machine translation	Helsinki-NLP/tatoeba_mt
`tatoeba_nno_eng` (Nynorsk → English)	❌ / `tatoeba_eng_nno_nn`	✅ / ✅	Machine translation	Helsinki-NLP/tatoeba_mt
`tatoeba_eng_nob` (English → Bokmål)	`tatoeba_eng_nob_nb` / ❌	✅ / ✅	Machine translation	Helsinki-NLP/tatoeba_mt
`tatoeba_nob_eng` (Bokmål → English)	`tatoeba_nob_eng_nb` / ❌	✅ / ✅	Machine translation	Helsinki-NLP/tatoeba_mt
`tatoeba_nob_nno` (Bokmål → Nynorsk)	`tatoeba_nob_nno_nb` / ❌	✅ / ❌	Machine translation	Helsinki-NLP/tatoeba_mt
`tatoeba_nno_nob` (Nynorsk → Bokmål)	❌ / `tatoeba_nno_nob_nn`	✅ / ❌	Machine translation	Helsinki-NLP/tatoeba_mt

Please find below the links to the relevant framework documentation:

Installation

Install one of the latest lm-evaluation-harness versions:

pip install --quiet https://github.com/EleutherAI/lm-evaluation-harness/archive/refs/tags/v0.4.2.tar.gz

Log in to your HuggingFace account. You can get your access token here.

pip install --quiet "huggingface_hub[cli]"
huggingface-cli login --token <YOUR TOKEN>

Usage

The original guidelines on the lm-evaluation-harness framework interface can be found here.

Must-have arguments

The high-level framework usage requires the following arguments:

--model_args: the model type; in our case, it is pretrained={model_name}, where model_name refers to the model name on HuggingFace (e.g., pretrained=mimir-project/mimir-7b-books).
--tasks: the name(s) of evaluation tasks (e.g., norcommonsenseqa_nb or noropenbookqa_nb,noropenbookqa_nb_use_fact).
--include_path: a path to custom configuration files in the .yaml format (in our case, it is mimir). this is used to add the mimir tasks to the framework's task registry as available tasks.
--log_samples: allows to save the model inputs and outputs in a directory specified with the help of the --output argument.
--output: a path where high-level results will be saved. if one provides --log_samples, both model predictions and results will be saved in the specified directory.
--write_out: a complementary function, which prints out the format of the prompts and outputs.
--show_config: a complementary function, which prints out the configuration file.
--batch_size: the batch size. "auto" allows to automatically select the largest batch size that will fit in memory, speeding up evaluation.
- NB: depending on the cluster, "auto" can still fail due to the out of memory error. the behavior can be controlled with the --max_batch_size and --batch_size auto:N arguemnts, where N stands for the number of times to re-select the maximum batch size during evaluation.
--num_fewshot: the number of demonstrations used in the model input.
--limit: selects first N examples and runs the evaluation on this subset. can be used for debugging or testing purposes.
--predict_only: allows to not compute the performance metrics but only save the predictions. should be used together with --log_samples.

Examples

In general, one needs to specify the following high-level arguments to conduct an evaluation run:

--tasks (the task name can be found in the tables above).
--model_args (any model on HuggingFace).
--batch_size ("auto"; one can test the largest batch size using the complementary arguments mentioned above, such as --limit, --max_batch_size, and --batch_size auto:N).
--include_path (always ./mimir/).
--num_fewshot (the supported k-shot setup for a given task can be found in the tables above).

Running zero-shot evaluation of the mimir-project/mimir-7b-books model on the norquad task using a default prompt.

lm_eval \
  --model hf \
  --model_args pretrained=mimir-project/mimir-7b-books \
  --tasks norquad \
  --include_path ./mimir/ \
  --output mimir_results/norquad/0-shot/mimir-7b-books/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0

Running 1-shot evaluation of the mimir-project/mimir-7b-books on the norquad_nb task, which involves testing the model on a set of 5 Norwegian Bokmål prompts.

lm_eval \
  --model hf \
  --model_args pretrained=mimir-project/mimir-7b-books \
  --tasks norquad_nb \
  --include_path ./mimir/ \
  --output mimir_results/norquad/0-shot/mimir-7b-books/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 1

Running 0-shot evaluation of the mimir-project/mimir-7b-books on the ask_gec task, which requires computation of the performance metric using a separate script. Here, we use the --predict_only argument and compute the performance metrics as described in the next subsection.

lm_eval \
  --model hf \
  --model_args pretrained=mimir-project/mimir-7b-books \
  --tasks ask_gec \
  --include_path ./mimir/ \
  --output mimir_results/ask_gec/0-shot/mimir-7b-books/ \
  --log_samples \
  --show_config \
  --write_out \
  --predict_only \
  --batch_size auto \
  --num_fewshot 0

Comments on performance metrics and inference

Inference

lm-evaluation-harness supports accelerate to speed up the evaluation. Please refer to the framework documentation for usage examples.

Performance metrics

BERTScore (used in the text summarization, machine translation, and headline generation tasks)

Unfortunately, there is an unresolved bug related to calculation of the BERTSCore. The current version of the mimir evaluation configuration files follows the proposed workaround; however, it slows the evaluation, since it loads the bertscore metric during evaluating each batch or prediction-reference pair.
Solution:
- One can discard computation of the BERTScore metric from the .yaml configuration file and use the --log_samples argument when conducting the evaluation run.
- Then, one can use mimir/bertscore.py to compute the metric score for a given file with the saved predictions, which significantly reduces the computation costs.
- Example:
```
python3 mimir/bertscore.py --fpath mimir_results/schibsted_vg/0-shot/mimir-7b-books/predictions.jsonl --out_fdir mimir_results/schibsted_vg/0-shot/mimir-7b-books/ --task_name schibsted_vg_nb --batch_size 128
```

ERRANT (used in the grammar error correction task)

This metric is calculated using a separate evaluation script, which can be found in mimir/ask_gec/errant.py.
Please refer to the installation instructions here.

Example:

python3 ask_gec/errant.py --fpath mimir_results/ask_gec/0-shot/mimir-7b-books/predictions.jsonl --out_fdir mimir_results/ask_gec/0-shot/mimir-7b-books/

Name		Name	Last commit message	Last commit date
Latest commit History 294 Commits
agg_results		agg_results
mimir		mimir
mimir_results		mimir_results
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mimir-evaluation-suite

Codebase

Tasks

Note

(1) Some HF datasets from the `mimir-project` repository are currently not available; to be released later.

(2) The `.yaml` configuration files are created according to version 0.4.2 of the `lm-evaluation-harness` framework. This repository will not updates beyond version 0.4.2. For an up-to-date codebase, use NorEval. However, you can still use 0.4.2 for empirical evaluation experiments.

Installation

Usage

Must-have arguments

Examples

Comments on performance metrics and inference

Inference

Performance metrics

About

Uh oh!

Releases

Packages

Languages

NbAiLab/mimir-evaluation

Folders and files

Latest commit

History

Repository files navigation

mimir-evaluation-suite

Codebase

Tasks

Note

(1) Some HF datasets from the mimir-project repository are currently not available; to be released later.

(2) The .yaml configuration files are created according to version 0.4.2 of the lm-evaluation-harness framework. This repository will not updates beyond version 0.4.2. For an up-to-date codebase, use NorEval. However, you can still use 0.4.2 for empirical evaluation experiments.

Installation

Usage

Must-have arguments

Examples

Comments on performance metrics and inference

Inference

Performance metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

(1) Some HF datasets from the `mimir-project` repository are currently not available; to be released later.

(2) The `.yaml` configuration files are created according to version 0.4.2 of the `lm-evaluation-harness` framework. This repository will not updates beyond version 0.4.2. For an up-to-date codebase, use NorEval. However, you can still use 0.4.2 for empirical evaluation experiments.

Packages