8000 GitHub - kelkalot/normedqa: NorMedQA is designed to evaluate the medical knowledge and reasoning capabilities of large language models (LLMs) in Norwegian context (Bokmål and Nynorsk). The benchmark consists of 1313 question-and-answer pairs covering various medical fields.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

NorMedQA is designed to evaluate the medical knowledge and reasoning capabilities of large language models (LLMs) in Norwegian context (Bokmål and Nynorsk). The benchmark consists of 1313 question-and-answer pairs covering various medical fields.

Notifications You must be signed in to change notification settings

kelkalot/normedqa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

👩‍⚕️ NorMedQA: Norwegian Medical Question Answering Benchmark & Dataset

This repository provides the configuration for running the NorMedQA benchmark using the lm-evaluation-harness.

Benchmark Description

NorMedQA is designed to evaluate the medical knowledge and reasoning capabilities of large language models (LLMs) in Norwegian context (Bokmål and Nynorsk).

The benchmark consists of 1313 question-and-answer pairs covering various medical fields. It also contains the wrong answers if the question had multiple choices (wrong_answers_text seperated via ;, which should allow multiple choice testing, not implemented in the provided example). The data was collected from publicly available sources containing Norwegian medical examination questions and subsequently checked, cleaned and preprocessed. The repository also contains a version that has answers included that require interpretation of an image or ECG (1401 question answer pairs, mutlimodal data will be added at a later stage).

Dataset

The dataset is hosted separately on Zenodo and can be downloaded either from Zenodo or Github.

Note: The repository includes code to split the original data file into train/test sets.

Usage

You can run the benchmark evaluation using the provided Colab notebook or by setting up the environment locally (see notebook and python file). This benchmark example only focuses on a single correct answer. This could be extended to multiple choice testing in the LM-evaluation-harness using the wrong answers provided.

  • Colab Notebook: Open In Colab

The notebook includes steps for installation, data splitting, and running the evaluation command (e.g., for google/gemma-3-1b-it).

Metrics

The example benchmark setup reports the following metrics (these can be extended based on intended use):

  • exact_match: Percentage of generated answers that exactly match the reference answer.
  • rouge (rouge1, rouge2, rougeL, rougeLsum): Measures overlap between generated and reference answers based on n-grams and longest common subsequences.

License

CC BY 4.0

Citation

If you use this benchmark configuration or the associated dataset, please cite:

@dataset{riegler_michael_alexander_2025_15320038,
  author       = {Riegler, Michael Alexander, Gautam, Sushant},
  title        = {{Norwegian Medical Question Answering Dataset - NorMedQA}},
  month        = may,
  year         = 2025,
  publisher    = {Zenodo},
  version      = {1.0},
  doi          = {10.5281/zenodo.15320038},
  url          = {[https://doi.org/10.5281/zenodo.15320037](https://doi.org/10.5281/zenodo.15320037)}
}

About

NorMedQA is designed to evaluate the medical knowledge and reasoning capabilities of large language models (LLMs) in Norwegian context (Bokmål and Nynorsk). The benchmark consists of 1313 question-and-answer pairs covering various medical fields.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0