🇳🇴 NorEval

NorEval is a multi-task Norwegian language understanding and generation evaluation benchmark.

🔥 Updates

16.05.2025: 🎉 Our paper is accepted to ACL 2025 Findings.
07.05.2025: 🫰 NorEval is now integrated into LM Evaluation Harness.
10.04.2025: 📕 Our pre-print is available on arXiv.
09.04.2025: 🚀 We release NorEval, including our annotation guidelines and novel datasets (NorRewrite-Instruct & NorSummarize-Instruct).

😎 Overview

Overview of the NorEval design. 😼 denotes datasets used in NorBench, NLEBench, ScandEval, and SEB, 🚀 represents datasets that have not been used in the existing Norwegian benchmarks, and 😎 denotes our novel datasets introduced as part of NorEval. EN=English; BM=Norwegian Bokmål; NN=Norwegian Nynorsk.

🇳🇴 NorEval combines 19 existing peer-reviewed datasets with five datasets created from scratch (NCB, NorRewrite-Instruct, NorSummarize-Instruct for Norwegian Bokmål, and NorIdiom for Norwegian Bokmål and Nynorsk). NorEval covers nine diverse task categories, including sentiment analysis, Norwegian language knowledge, Norwegian-specific & world knowledge, machine reading comprehension, commonsense reasoning, machine translation, text summarization, instruction following, and truthfulness. Our main evaluation principles are:

🌐 Linguistic diversity: support for both of the official written standards of Norwegian: Bokmål and Nynorsk (the minority variant).
📊 Task diversity: coverage of various least addressed tasks for Norwegian. In particular, only three out of 24 NorEval datasets are included in existing Norwegian benchmarks to date: NorBench, NLEBench, ScandEval, and SEB.
🧠 Data quality: focus on only peer-reviewed human-created datasets to ensure reliable evaluation in the context of the Norwegian language, culture, and values.
📏 Prompt sensitivity: evaluation across 100+ human-written prompts to account for the prompt sensitivity.
👩🏻‍🔬 Standardized evaluation: integration of NorEval into LM Evaluation Harness for flexible and reproducible evaluation.

🗃️ Tasks

We group our datasets into text classification, sentence ranking, sentence completion, multiple-choice question answering, generative question answering, and sequence-to-sequence generation tasks. We refer the reader to our paper for more details and describe our tasks below.

Name	Bokmål	Nynorsk	k-shot	Task type	Task category
NoReC Sentence	`norec_sentence`	❌	✅	Text classification	Sentiment analysis
NoReC Document	`norec_document`	❌	✅	Text classification	Sentiment analysis
NCB	`ncb`	❌	❌	Sentence ranking	Norwegian language knowledge
NorIdiom	`noridiom_nob`	`noridiom_nno`	❌	Sentence completion	Norwegian language knowledge
Belebele	`norbelebele`	❌	❌	Multiple-choice question answering	Machine reading comprehension
NRK-Quiz-QA	`nrk_quiz_qa_nob`	`nrk_quiz_qa_nno`	❌	Multiple-choice question answering	Norwegian-specific & world knowledge
NorOpenBookQA	`noropenbookqa_nob`	`noropenbookqa_nno`	✅	Multiple-choice question answering	Norwegian-specific & world knowledge
NorCommonsenseQA	`norcommonsenseqa_nob`	`norcommonsenseqa_nno`	❌	Multiple-choice question answering	Commonsense reasoning
NorTruthfulQA Multiple choice	`nortruthfulqa_mc_nob`	`nortruthfulqa_mc_nno`	❌	Multiple-choice question answering	Truthfulness
NorQuAD	`norquad`	❌	✅	Generative question answering	Machine reading comprehension
NorTruthfulQA Generation	`nortruthfulqa_gen_nob`	`nortruthfulqa_gen_nno`	❌	Generative question answering	Truthfulness
ASK-GEC	`ask_gec`	❌	✅	Sequence-to-sequence generation	Norwegian language knowledge
NorSumm	`norsumm_nob`	`norsumm_nno`	✅	Sequence-to-sequence generation	Text summarization
Tatoeba (English → Bokmål/Nynorsk)	`tatoeba_eng_nob`	`tatoeba_eng_nno`	✅	Sequence-to-sequence generation	Machine translation
Tatoeba (Bokmål/Nynorsk → English)	`tatoeba_nob_eng`	`tatoeba_nno_eng`	✅	Sequence-to-sequence generation	Machine translation
NorRewrite-Instruct	`norrewrite_instruct`	❌	❌	Sequence-to-sequence generation	Instruction following
NorSummarize-Instruct	`norsummarize_instruct`	❌	❌	Sequence-to-sequence generation	Instruction following

Table description

Name: a dataset name with a HuggingFace link.
Bokmål: the LM Evaluation Harness task name for the Norwegian Bokmål dataset.
Nynorsk: the LM Evaluation Harness task name for the Norwegian Nynorsk dataset, if available.
k-shot: the support for k-shot evaluation regimes with k > 0. We follow the original datasets' design and focus mainly on the zero-shot evaluation by default.
- ✅ means that the user can run the evaluation in both zero-shot and k-shot regimes.
- ❌ denotes that only the zero-shot evaluation regime is available due to the lack of the training or validation set to sample the demonstration examples from. Technically, k-shot evaluation on the test set is possible using sampling without replacement, given that the model is not proprietary and not accessed via an API.
Task type: the task type.
Task category: the task category.

👨🏻‍💻 Installation and Usage

Install LM Evaluation Harness as described here.

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

Examples

Detailed guidelines on how to use LM Evaluation Harness can be found here.

Example 1: Zero-shot evaluation on NorQuAD across five prompts.

lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norquad \
  --output results/norquad/0-shot/\
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0

Example 2: One-shot evaluation on NorQuAD across five prompts.

lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norquad \
  --output results/norquad/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 1

Example 3: Zero-shot evaluation on NorQuAD using one prompt of interest.

All prompts are numbered from 0 to 6, and the corresponding configuration files for all supported prompts can be found in the task directories.

lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norquad_p0 \
  --output results/norquad_p0/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0

Example 4: Zero-shot evaluation on task groups.

Consider an example of conducting an evaluation on a task category of interest, e.g., Norwegian-specific & world knowledge. LM Evaluation Harness allows to group tasks as shown below; please find more details here.

Step 1: Create a configuration file

Create a configuration file containing the name of the group and corresponding tasks and save it in the lm_eval/tasks/noreval folder.

group: norwegian_specific_and_world_knowledge_tasks_nob
task:
  - nrk_quiz_qa_nob
  - noropenbookqa_nob
aggregate_metric_list:
  - metric: acc
    weight_by_size: True

Step 2: Run the evaluation

Here, we are specifying the name of our created group as tasks and pass the include_path argument to ensure our group is registered:

lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norwegian_specific_and_world_knowledge_tasks_nob \
  --include_path ./lm_eval/tasks/noreval/ \
  --output results/norwegian_specific_and_world_knowledge_tasks_nob/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0

Example 5: Zero-shot evaluation on ASK-GEC, which requires computation of the performance metric using a separate script.

Here, we use the --predict_only argument and compute the performance metrics as described below.

Step 1: Generate the predictions

lm_eval \
  --model hf \
  --model_args pretrained=AI-Sweden-Models/Llama-3-8B \
  --tasks ask_gec \
  --output results/ask_gec/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --predict_only \
  --batch_size auto \
  --num_fewshot 0

Step 2: Evaluate the predictions with ERRANT

Please refer to the installation instructions here.

Run the following:

python3 ask_gec/errant.py --fpath results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441.jsonl --out_fdir results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/

The results will be saved as results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441_errant.json

Comment: BERTScore.

In our paper, we compute BERTScore for most sequence-to-sequence generation tasks offline after running the evaluation with the --predict_only argument for efficiency.

📝 Cite Us

@article{mikhailov2025noreval,
  title={NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark},
  author={Mikhailov, Vladislav and Enstad, Tita and Samuel, David and Farseth{\aa}s, Hans Christian and Kutuzov, Andrey and Velldal, Erik and {\O}vrelid, Lilja},
  journal={arXiv preprint arXiv:2504.07749},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
guidelines		guidelines
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
noreval.jpg		noreval.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🇳🇴 NorEval

🔥 Updates

📖 Contents

😎 Overview

🗃️ Tasks

👨🏻‍💻 Installation and Usage

Examples

📝 Cite Us

About

Uh oh!

Releases

Packages

License

ltgoslo/noreval

Folders and files

Latest commit

History

Repository files navigation

🇳🇴 NorEval

🔥 Updates

📖 Contents

😎 Overview

🗃️ Tasks

👨🏻‍💻 Installation and Usage

Examples

📝 Cite Us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages