FaithBench: a human-annotated benchmark on challenging summarization hallucinations of modern LLMs

This repo releases the final data for the NAACL 2025 paper: FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs. A more detailed repo is here.

Utility scripts

Under the folder ./scripts/, the following scripts are to help you process the data.

./scripts/binarize.py is to aggregate the annotations to sample-level labels, and binarize the labels to 0/1 for binary classification.
./scripts/how_to_load.py is an example of how to load the data.

Data format

Human-annotated samples are released under ./data_for_release/, of the name batch_{batch_id}.json, where {batch_id} ranges from 1 to 16, except 13.

The data schema is defined in ./scripts/faithbench_schema.py. In short, a sample contains:

sample_id: the id of the sample in this batch
source: the source text
summary: the summary text
annotations: a list of annotations -- can be empty if no hallucinations are found. A summary may contain multiple annotations. Different annotators may find different spans of the summary to be non-consistent. The annotation of each non-consistent span includes the label, the note, and the span text on the summary, etc.
metadata: metadata about the sample, including the summarizer LLM, the prediction of SOTA detectors, and the raw sample id.

The JSON file of each batch is a dump of the FaithBenchBatch object (defined in ./scripts/faithbench_schema.py).

{
  "samples": [
    {
      "sample_id": 0,
      "source": "Poseidon (film) . Poseidon grossed $ 181,674,817 at the worldwide box office on a budget of $ 160 million .",
      "summary": " The film \"Poseidon\" grossed $181,674,817 at the worldwide box office, with a production budget of $160 million.",
      "annotations": [
        {
          "annot_id": 1,
          "annotator": "a3ac21668e6249b7978617da547f2708",
          "annotator_name": "Alan Turing",
          "label": [
            "Unwanted",
            "Unwanted.Instrinsic"
          ],
          "note": "\"budget\" (source) vs. \"production budget\" (summary)\nThe budget for a movie may also include non-production budget such as distribution, advertising. ",
          "summary_span": "production",
          "summary_start": 78,
          "summary_end": 88
        },
        {
          "annot_id": 60,
          "annotator": "69a785fa7f454e7da5eef3c608b2133a",
          "annotator_name": "Claude Shannon",
          "label": [
            "Unwanted",
            "Unwanted.Instrinsic"
          ],
          "note": "\"budget\" (source) vs. \"production budget\" (summary) The budget for a movie may also include non-production budget such as distribution, advertising. ",
          "summary_span": "production",
          "summary_start": 78,
          "summary_end": 88
        }
      ],
      "metadata": {
        "summarizer": "mistralai/Mistral-7B-Instruct-v0.3",
        "hhemv1": 0.9995,
        "hhem-2.1": 0.52694,
        "hhem-2.1-english": 0.98313,
        "trueteacher": 1,
        "true_nli": 1,
        "gpt-3.5-turbo": 1,
        "gpt-4-turbo": 1,
        "gpt_4o": 1,
        "raw_sample_id": 15
      },
    },
    // More samples in this batch
  ]
}

Sample inclusion

Today's hallucination detectors, including those using LLMs as judges (LLM-as-a-judge), are very effective at detecting trivial hallucinations. FaithBench does not include trivial hallucinations because they won't help improve the detectors.

But finding hard or challenging samples that are worth annotating is a chicken-and-egg problem:

Let's find and annotate samples that are hard for SOTA hallucination detectors.
How do we know which samples are hard? We have to annotate them first.

FaithBench takes a proxy route: we hypothesize that a hard sample will have split judgments from SOTA hallucination detectors. Thus, we send a sample to multiple SOTA hallucination detectors, and select the samples that have the highest entropy of judgments.

graph TD
    A["A sample (a pair of source and summary)"]
    subgraph JG[SOTA detectors]
       B[GPT-4o-as-a-judge]
       C[GPT-3.5-Turbo-as-a-judge]
       D[Google's TrueTeacher]
       E[Google's True NLI]
       F[Vectara's HHEM-2.1-Open]
    end
    A -->B
    A -->C
    A -->D
    A -->E
    A -->F

    G[Computing entropy]
    B -->G
    C -->G
    D -->G
    E -->G
    F -->G

Taxonomy of hallucinations

In FaithBench, not all halluciations are bad. Some are benign. Our taxonomy is:

Consistent -- no annotations needed
Questionable
Hallucinated
- Unwanted
  - Intrinsic
  - Extrinsic
- Benign

Citation

@inproceedings{faithbench-2025-naacl,
    title = "{F}aith{B}ench: A Diverse Hallucination Benchmark for Summarization by Modern {LLM}s",
    author = "Bao, Forrest Sheng  and
      Li, Miaoran  and
      Qu, Renyi  and
      Luo, Ge  and
      Wan, Erana  and
      Tang, Yujia  and
      Fan, Weisi  and
      Tamber, Manveer Singh  and
      Kazi, Suleman  and
      Sourabh, Vivek  and
      Qi, Mike  and
      Tu, Ruixuan  and
      Xu, Chenyu  and
      Gonzales, Matthew  and
      Mendelevitch, Ofer  and
      Ahmad, Amin",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-short.38/",
    pages = "448--461",
    ISBN = "979-8-89176-190-2"
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
annot		annot
8000 data_for_release		data_for_release
scripts		scripts
FaithBench.csv		FaithBench.csv
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FaithBench: a human-annotated benchmark on challenging summarization hallucinations of modern LLMs

Utility scripts

Data format

Sample inclusion

Taxonomy of hallucinations

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

vectara/FaithBench

Folders and files

Latest commit

History

Repository files navigation

FaithBench: a human-annotated benchmark on challenging summarization hallucinations of modern LLMs

Utility scripts

Data format

Sample inclusion

Taxonomy of hallucinations

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages