8000 GitHub - ltgoslo/noreval: A Norwegian Language Understanding and Generation Evaluation Benchmark
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ltgoslo/noreval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🇳🇴 NorEval

NorEval is a multi-task Norwegian language understanding and generation evaluation benchmark.

🔥 Updates

📖 Contents

😎 Overview

noreval

Overview of the NorEval design. 😼 denotes datasets used in NorBench, NLEBench, ScandEval, and SEB, 🚀 represents datasets that have not been used in the existing Norwegian benchmarks, and 😎 denotes our novel datasets introduced as part of NorEval. EN=English; BM=Norwegian Bokmål; NN=Norwegian Nynorsk.

🇳🇴 NorEval combines 19 existing peer-reviewed datasets with five datasets created from scratch (NCB, NorRewrite-Instruct, NorSummarize-Instruct for Norwegian Bokmål, and NorIdiom for Norwegian Bokmål and Nynorsk). NorEval covers nine diverse task categories, including sentiment analysis, Norwegian language knowledge, Norwegian-specific & world knowledge, machine reading comprehension, commonsense reasoning, machine translation, text summarization, instruction following, and truthfulness. Our main evaluation principles are:

  • 🌐 Linguistic diversity: support for both of the official written standards of Norwegian: Bokmål and Nynorsk (the minority variant).
  • 📊 Task diversity: coverage of various least addressed tasks for Norwegian. In particular, only three out of 24 NorEval datasets are included in existing Norwegian benchmarks to date: NorBench, NLEBench, ScandEval, and SEB.
  • 🧠 Data quality: focus on only peer-reviewed human-created datasets to ensure reliable evaluation in the context of the Norwegian language, culture, and values.
  • 📏 Prompt sensitivity: evaluation across 100+ human-written prompts to account for the prompt sensitivity.
  • 👩🏻‍🔬 Standardized evaluation: integration of NorEval into LM Evaluation Harness for flexible and reproducible evaluation.

🗃️ Tasks

We group our datasets into text classification, sentence ranking, sentence completion, multiple-choice question answering, generative question answering, and sequence-to-sequence generation tasks. We refer the reader to our paper for more details and describe our tasks below.

Name Bokmål Nynorsk k-shot Task type Task category
NoReC Sentence norec_sentence Text classification Sentiment analysis
NoReC Document norec_document Text classification Sentiment analysis
NCB ncb Sentence ranking Norwegian language knowledge
NorIdiom noridiom_nob noridiom_nno Sentence completion Norwegian language knowledge
Belebele norbelebele Multiple-choice question answering Machine reading comprehension
NRK-Quiz-QA nrk_quiz_qa_nob nrk_quiz_qa_nno Multiple-choice question answering Norwegian-specific & world knowledge
NorOpenBookQA noropenbookqa_nob noropenbookqa_nno Multiple-choice question answering Norwegian-specific & world knowledge
NorCommonsenseQA norcommonsenseqa_nob norcommonsenseqa_nno Multiple-choice question answering Commonsense reasoning
NorTruthfulQA Multiple choice nortruthfulqa_mc_nob nortruthfulqa_mc_nno Multiple-choice question answering Truthfulness
NorQuAD norquad Generative question answering Machine reading comprehension
NorTruthfulQA Generation nortruthfulqa_gen_nob nortruthfulqa_gen_nno Generative question answering Truthfulness
ASK-GEC ask_gec Sequence-to-sequence generation Norwegian language knowledge
NorSumm norsumm_nob norsumm_nno Sequence-to-sequence generation Text summarization
Tatoeba (English → Bokmål/Nynorsk) tatoeba_eng_nob tatoeba_eng_nno Sequence-to-sequence generation Machine translation
Tatoeba (Bokmål/Nynorsk → English) tatoeba_nob_eng tatoeba_nno_eng Sequence-to-sequence generation Machine translation
NorRewrite-Instruct norrewrite_instruct Sequence-to-sequence generation Instruction following
NorSummarize-Instruct norsummarize_instruct Sequence-to-sequence generation Instruction following
Table description
  • Name: a dataset name with a HuggingFace link.
  • Bokmål: the LM Evaluation Harness task name for the Norwegian Bokmål dataset.
  • Nynorsk: the LM Evaluation Harness task name for the Norwegian Nynorsk dataset, if available.
  • k-shot: the support for k-shot evaluation regimes with k > 0. We follow the original datasets' design and focus mainly on the zero-shot evaluation by default.
    • ✅ means that the user can run the evaluation in both zero-shot and k-shot regimes.
    • ❌ denotes that only the zero-shot evaluation regime is available due to the lack of the training or validation set to sample the demonstration examples from. Technically, k-shot evaluation on the test set is possible using sampling without replacement, given that the model is not proprietary and not accessed via an API.
  • Task type: the task type.
  • Task category: the task category.

👨🏻‍💻 Installation and Usage

Install LM Evaluation Harness as described here.

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

Examples

Detailed guidelines on how to use LM Evaluation Harness can be found here.

Example 1: Zero-shot evaluation on NorQuAD across five prompts.
lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norquad \
  --output results/norquad/0-shot/\
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0
Example 2: One-shot evaluation on NorQuAD across five prompts.
lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norquad \
  --output results/norquad/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 1
Example 3: Zero-shot evaluation on NorQuAD using one prompt of interest.

All prompts are numbered from 0 to 6, and the corresponding configuration files for all supported prompts can be found in the task directories.

lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norquad_p0 \
  --output results/norquad_p0/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0
Example 4: Zero-shot evaluation on task groups.

Consider an example of conducting an evaluation on a task category of interest, e.g., Norwegian-specific & world knowledge. LM Evaluation Harness allows to group tasks as shown below; please find more details here.

Step 1: Create a configuration file

Create a configuration file containing the name of the group and corresponding tasks and save it in the lm_eval/tasks/noreval folder.

group: norwegian_specific_and_world_knowledge_tasks_nob
task:
  - nrk_quiz_qa_nob
  - noropenbookqa_nob
aggregate_metric_list:
  - metric: acc
    weight_by_size: True

Step 2: Run the evaluation

Here, we are specifying the name of our created group as tasks and pass the include_path argument to ensure our group is registered:

lm_eval \
  --model hf \
  --model_args pretrained=norallm/normistral-7b-warm \
  --tasks norwegian_specific_and_world_knowledge_tasks_nob \
  --include_path ./lm_eval/tasks/noreval/ \
  --output results/norwegian_specific_and_world_knowledge_tasks_nob/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --batch_size auto \
  --num_fewshot 0
Example 5: Zero-shot evaluation on ASK-GEC, which requires computation of the performance metric using a separate script.

Here, we use the --predict_only argument and compute the performance metrics as described below.

Step 1: Generate the predictions

lm_eval \
  --model hf \
  --model_args pretrained=AI-Sweden-Models/Llama-3-8B \
  --tasks ask_gec \
  --output results/ask_gec/0-shot/ \
  --log_samples \
  --show_config \
  --write_out \
  --predict_only \
  --batch_size auto \
  --num_fewshot 0

Step 2: Evaluate the predictions with ERRANT

  • Please refer to the installation instructions here.
  • Run the following:
    python3 ask_gec/errant.py --fpath results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441.jsonl --out_fdir results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/
  • The results will be saved as results/ask_gec/0-shot/AI-Sweden-Models__Llama-3-8B/samples_ask_gec_p0_2025-01-28T01-08-13.454441_errant.json
Comment: BERTScore.

In our paper, we compute BERTScore for most sequence-to-sequence generation tasks offline after running the evaluation with the --predict_only argument for efficiency.

📝 Cite Us

@article{mikhailov2025noreval,
  title={NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark},
  author={Mikhailov, Vladislav and Enstad, Tita and Samuel, David and Farseth{\aa}s, Hans Christian and Kutuzov, Andrey and Velldal, Erik and {\O}vrelid, Lilja},
  journal={arXiv preprint arXiv:2504.07749},
  year={2025}
}

About

A Norwegian Language Understanding and Generation Evaluation Benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0