8000 GitHub - utter-project/ELITR-Bench: Benchmark for the evaluation of long-context LLMs on meeting transcripts
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

utter-project/ELITR-Bench

Repository files navigation

ELITR-Bench description

This repository contains the dataset and code associated to ELITR-Bench – a benchmark for the evaluation of long-context LLMs on meeting transcripts. The meeting data used in this benchmark originally comes from the ELITR dataset, which is available here.

For more details on this dataset and the experiments we conducted, have a look at our paper. If you found ELITR-Bench useful and want to refer to our work, please use the following citation:

@article{thonet2024elitrbench,
      title={{ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models}}, 
      author={Thibaut Thonet and Jos Rozen and Laurent Besacier},
      journal={arXiv:2403.20262},
      year={2024},
      url={https://arxiv.org/abs/2403.20262}
}

NEWS: Our paper has been accepted to the COLING 2025 conference and will be presented in Abu Dhabi, UAE from January 19th to 24th, 2025!

Data description

The repository includes two zip archives: data.zip and generated-responses.zip. The archives are protected by a password to avoid potential contamination of LLMs trained on web scrapped data. To unzip the archives, run the commands unzip data.zip and unzip generated-responses.zip and indicate 'utter' as password.

After these two unzip commands, two folders will be created: data and generated-responses. The former contains the ELITR-Bench data files with the manually crafted questions, ground-truth answers and metadata. The latter contains the responses generated on ELITR-Bench by different long-context LLMs, with their evaluation. All files are provided in JSON format.

The JSON files are named based on the following template:

  • elitr-bench-{dataset_version}_{split}.json for the data files;
  • elitr-bench-{dataset_version}_{split}_{inference_mode}_{evaluator}.json for the generated responses;

where:

  • {dataset_version} is either the qa or conv version of ELITR-Bench;
  • {inference_mode} is either the st (single-turn) or mt (multi-turn) mode used for inference, i.e., either questions are asked independently or in sequence within the same conversation;
  • {split} is either the dev or test2 split set;
  • {evaluator} is either gpt-4-eval (indicating that the reported scores are obtained from the GPT-4 evaluator) or all-eval (indicating that we report the scores for the 4 different evaluators considered in the "LLM-based evaluation assessment" section of the paper).

JSON file structure

Each JSON file is structured as follows:

{
  "split": "...",
  "meetings": [
    {
      "id": "...",
      "questions": [
        {
          "id": "...",
          "question-type": "...",
          "answer-position": "...",
          "question": "...",
          "groundtruth-answer": "...",
          "generated-responses": [
            {
              "model": "...",
              "generated-response": "...",
              "gpt-4-eval_score": "..."
              "prometheus-eval_score": "...",
              "gold-human-eval_score": "...",
              "silver-human-eval_score": "..."
            },
            ...
          ]
        },
        ...
      ]
    },
    ...
  ]
}

where:

  • split is either dev or test2;
  • meetings is the list of meetings associated to the set;
  • id (meeting-level) is the name of the corresponding meeting transcript file from the original ELITR dataset;
  • questions is the list of questions associated to the meeting;
  • id (question-level) is a numeric identifier for each question in the meeting;
  • question-type indicates the type of the questions, among who, what, howmany and when;
  • answer-position indicates the position of the answer in the transcript, among B (beginning), M (middle), E (end) and S (several passages across the transcript);
  • question contains the actual question;
  • groundtruth-answer contains the manually annotated ground-truth answer to the question;
  • generated-responses is the list of responses generated by different LLMs to answer the question (this list is only present for the JSON files in the generated-responses folder);
  • model indicates the LLM used to generate the answer;
  • generated-response contains the actual generated answer;
  • gpt-4-eval_score, prometheus-eval_score, gold-human-eval_score, silver-human-eval_score indicate the numeric score between 1 and 10 obtained by different evaluators for the generated answer (prometheus-eval_score, gold-human-eval_score, silver-human-eval_score are only reported for the all-eval JSON file).

Czech Version (Update Dec 2024)

We are thrilled to announce the addition of a Czech version to the ELITR-Bench work (czech.zip)! Thanks to the dedicated efforts of Ondřej Bojar from Charles University in Prague, Czech Republic, all questions and reference answers have been translated into Czech. This update enables a cross-lingual version of the meeting assistant task, allowing users to ask questions in Czech about an English meeting transcript and receive answers in Czech. These answers can now be evaluated against Czech reference answers, broadening the task's accessibility and utility. To unzip the archive, run the commands unzip czech.zip and indicate 'utter' as password.

How to run the code

Package and version requirements

This code should be run with Python 3.10. The required Python packages, along with their version, are provided in requirements.txt. Note that to run the inference with more recent models (LLaMA-3.1 and Phi-3), different package versions are required; these are specified in requirements-alt.txt. This will be reminded when introducing the corresponding scripts.

Preparation

Before running the code, the following steps must be conducted to move the required data files in the proper folders:

  • Unzip the ELITR-Bench data archive named data.zip using the password 'utter'.
  • Fetch and prepare the ELITR transcripts through the following steps:
    1. Download ELITR-minuting-corpus.zip from this link.
    2. Unzip the archive at the root of the project. This should create a directory named ELITR-minuting-corpus.
    3. Run the following Python script to extract the transcripts from ELITR-minuting-corpus (which includes moving them to data and renaming them based on their corresponding meeting ID):
      python -m preparation.extract_transcripts
      

Response generation

In this step, we detail the commands to generate responses for the ELITR-Bench questions, using the different models included in the paper.

  • Generating responses with LLaMA-2-based models is done with the inference/inference_llama2.py script. This script should be used for the following models: LongAlpaca-{7B, 13B}, LongChat-7B-v1.5, Vicuna-{7B, 13B}-v1.5, LongAlign-{7B, 13B}. Example command for inference with Vicuna-13b-v1.5 on ELITR-Bench-QA test set in single-turn mode:
    python -m inference.inference_llama2 --data_path "./data" --json_filename "elitr-bench-qa_test2.json" --base_model "lmsys/vicuna-13b-v1.5-16k" --flash_attn True --max_gen_len 512 --context_size 32768 --temperature 0.6 --top_p 0.9 --repetition_penalty 1.1 --do_sample True --conv_format "vicuna_v1.1" --q_marker True --mode st --seed 2023
    
  • Generating responses with LLaMA-3.1-8B is done with the inference/inference_llama3dot1.py script. This script requires the package versions indicated in the requirements-alt.txt file. Example command for inference on ELITR-Bench-QA test set in single-turn mode:
    python -m inference.inference_llama3dot1 --data_path "./data" --json_filename "elitr-bench-qa_test2.json" --base_model "meta-llama/Meta-Llama-3.1-8B-Instruct" --max_gen_len 512 --temperature 0.0 --top_p 1.0 --repetition_penalty 1.0 --do_sample False --q_marker False --mode st --seed 2023
    
  • Generating responses with Phi-3-small-128k is done with the inference/inference_phi3.py script. This script requires the package versions indicated in the requirements-alt.txt file. Example command for inference on ELITR-Bench-QA test set in single-turn mode:
    python -m inference.inference_phi3 --data_path "./data" --json_filename "elitr-bench-qa_test2.json" --base_model "microsoft/Phi-3-small-128k-instruct" --max_gen_len 512 --temperature 0.0 --top_p 1.0 --repetition_penalty 1.0 --do_sample False --q_marker False --mode st --seed 2023
    
  • Generating responses with OpenAI GPT models is done with the inference/inference_gpt.py script. This script requires that the environment variable OPENAI_API_KEY is defined in your bash profile, to provide access to your OpenAI account. Example command for inference with GPT-4 on ELITR-Bench-QA test set in single-turn mode:
    python -m inference.inference_gpt --data_path "./data" --json_filename "elitr-bench-qa_test2.json" --base_model "gpt-4-1106-preview" --max_gen_len 512 --temperature 0.6 --top_p 0.9 --mode st --seed 2023
    

To generate responses in the cross-lingual scenario (with English transcripts, Czech questions and expecting Czech responses), the argument --lang "czech" should be added to the command to select the appropriate prompt. Note that the cross-lingual scenario also requires using the JSON files contained in the czech.zip archive (e.g., using elitr-bench-czech-qa_test2.json instead of elitr-bench-qa_test2.json).

Evaluation

Generated responses can be automatically evaluated using either GPT-4 (and other OpenAI models) or Prometheus as LLM-judges:

  • The evaluation with OpenAI GPT models is done with the eval/eval_gpt.py script. This script requires that the environment variable OPENAI_API_KE 76A0 Y is defined in your bash profile, to provide access to your OpenAI account. Example command for evaluation with GPT-4 on responses generated for ELITR-Bench-QA test set in single-turn mode:
    python -m eval.eval_gpt --data_path "./data" --json_filename "elitr-bench-qa_test2_st_s2023.json" --base_model "gpt-4-0613" --temperature 0.6 --top_p 0.9 --max_gen_len 1024 --seed 2023
    
  • The evaluation with Prometheus is done with the eval/eval_prometheus.py script. Example command for evaluation on responses generated for ELITR-Bench-QA test set in single-turn mode:
    python -m eval.eval_prometheus --data_path "./data" --json_filename "elitr-bench-qa_test2_st_s2023.json" --base_model "prometheus-eval/prometheus-13b-v1.0" --temperature 0.6 --top_p 0.9 --max_gen_len 1024 --repetition_penalty 1.0 --do_sample True --seed 2023
    

Similarly to the response selection scripts, the argument --lang "czech" should be added to the commands above to perform the evaluation in the English-Czech cross-lingual scenario. Again, the cross-lingual scenario also requires using the JSON files contained in the czech.zip archive (e.g., using elitr-bench-czech-qa_test2.json instead of elitr-bench-qa_test2.json).

Funding

This is an output of the European Project UTTER (Unified Transcription and Translation for Extended Reality) funded by European Union’s Horizon Europe Research and Innovation programme under grant agreement number 101070631.

For more information please visit https://he-utter.eu/

NAVER LABS Europe: https://europe.naverlabs.com/

About

Benchmark for the evaluation of long-context LLMs on meeting transcripts

Resources

License

Unknown, CC-BY-4.0 licenses found

Licenses found

Unknown
LICENSE-CODE.txt
CC-BY-4.0
LICENSE-DATA.txt

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages

0