NLU and NLG datasets developed within the Latvian Language Technology Initiative
-
ALPACA-LV is a machine translated Alpaca instruction dataset for Latvian.
-
COPA is a machine translated COPA benchmark dataset for Latvian.
-
MMLU is a machine translated MMLU benchmark dataset for Latvian. The
sociology_postedited.json
file contains a post-edited collection of the first 100 tasks in the sociology subject. -
Multiple-choice questions (MCQ) from Latvian Centralized High School Exams.
If you find this useful in your research, please consider citing:
@inproceedings{dargis-etal-2024-evaluating,
author = "Darģis, Roberts and Bārzdiņš, Guntis and Skadiņa, Inguna and Grūzītis, Normunds and Saulīte, Baiba",
title = "Evaluating Open-Source LLMs in Low-Resource Languages: Insights from Latvian High School Exams",
year = 2024,
booktitle = "Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities",
publisher = "Association for Computational Linguistics",
pages = "289-293",
month = "Nov",
url = "https://aclanthology.org/2024.nlp4dh-1.28.pdf"
}
@inproceedings{Skadina-EtAl:2025,
author = "Skadiņa, Inguna and Bakanovs, Bruno and Darģis, Roberts",
title = "First Steps in Benchmarking Latvian in Large Language Models",
year = 2025,
journal = "Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)",
publisher = "University of Tartu Library",
pages = "86-95",
url = "https://hdl.handle.net/10062/107120"
}