Climate+Tech FactChecker

Climate+Tech FactChecker is designed to serve as a comprehensive toolkit for both experimentation and production environments, focusing on the verification of claims. It offers a robust suite of tools and methodologies to assist researchers, developers, and practitioners in the field of claim verification, enabling them to efficiently test hypotheses, validate data, and deploy reliable fact-checking solutions.

Installation

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Quick Start

After installation, configure the environment variables, specifically which LLM you want to use, if you want to use OpenAI or Ollama, embeddings, chunk size etc.

-> see Section below

Download the sources via the sources downloader

python -m factchecker.tools.sources_downloader

There is a detailled explanation on sources below

Run an example

We recommend running the Advocate Mediator example, to test if you set up everything correctly:

python -m factchecker.experiments.advocate_mediator_climatefeedback.advocate_mediator_climatefeedback

Configuration

Setting Up Environment Variables

Before running the Climate+Tech FactChecker, you need to configure your environment variables. This is done using the .env.example file provided in the repository.

Copy the .env.example file to a new file named .env:
```
cp .env.example .env
```
Edit the .env file to include your specific configuration:
- Replace your_openai_api_key_here with your actual OpenAI API key.
- Update OPENAI_ORGANIZATION with your OpenAI organization ID if applicable.
- Adjust other variables like LLM_TYPE, OLLAMA_API_BASE_URL, and OLLAMA_MODEL as needed for your setup.
- Configure embedding model settings (see Embedding Models section below)

Project Structure

The project follows a modular structure:

factchecker/: Main package directory
- core/: Core functionality including LLM and embedding models
- experiments/: Contains experiment scripts for different fact-checking approaches
- strategies/: Core fact-checking strategy implementations
- utils/: Utility functions and helper modules
- tools/: Utility scripts for tasks like downloading sources
tests/: Test suite following the same structure as the main package
storage/: Data storage for indices and other persistent data
data/: (gitignored) Directory for storing downloaded source documents

Modules explained

Configuring the LLM

The project uses LlamaIndex's LLM interface through the factchecker/core/llm.py module:

OpenAI (Default)
- Set LLM_TYPE=openai in .env
- Required settings:
  - OPENAI_API_KEY: Your OpenAI API key
- Optional settings:
  - OPENAI_API_BASE: Custom API endpoint (default: OpenAI's API)
  - OPENAI_API_MODEL: Model to use (default: "gpt-3.5-turbo-1106")
  - OPENAI_ORGANIZATION: Your organization ID
  - TEMPERATURE: Model temperature (default: 0.1) - shared with Ollama
- Uses llama_index.llms.openai.OpenAI under the hood
Ollama
- Set LLM_TYPE=ollama in .env
- Required settings:
  - OLLAMA_MODEL: Model to use (e.g., "llama2", "mistral")
- Optional settings:
  - OLLAMA_API_BASE_URL: Custom API endpoint (default: "http://localhost:11434")
  - OLLAMA_REQUEST_TIMEOUT: Request timeout in seconds (default: 120.0)
  - TEMPERATURE: Model temperature (default: 0.1) - shared with OpenAI
- Uses llama_index.llms.ollama.Ollama under the hood

Example usage:

from factchecker.core.llm import load_llm

# Using default OpenAI settings
llm = load_llm()

# Using OpenAI with custom settings
llm = load_llm(
    llm_type="openai",
    model="gpt-3.5-turbo-1106",
    temperature=0.1,
    api_key="your-key",
    organization="your-org",
    context_window=4096  # Optional: control context window size
)

# Using Ollama with custom settings
llm = load_llm(
    llm_type="ollama",
    model="mistral",
    temperature=0.1,
    request_timeout=120.0,
    context_window=4096  # Optional: control context window size
)

Note: The LLM interface is compatible with LlamaIndex's query engine, retriever, and other components. You can use any LlamaIndex-supported LLM by modifying the loader implementation.

Embedding Models

The project uses LlamaIndex's embedding interface through the factchecker/core/embeddings.py module:

OpenAI Embeddings (Default)
- Set EMBEDDING_TYPE=openai in .env (or omit for default)
- Required settings:
  - OPENAI_API_KEY: Your OpenAI API key
- Optional settings:
  - OPENAI_EMBEDDING_MODEL: Model to use (default: "text-embedding-ada-002")
  - OPENAI_API_BASE: Custom API endpoint
- Features:
  - High-quality embeddings
  - Consistent dimensionality (1536)
  - Production-ready reliability
- Uses llama_index.embeddings.openai.OpenAIEmbedding
HuggingFace Embeddings
- Set EMBEDDING_TYPE=huggingface in .env
- Optional settings:
  - HUGGINGFACE_EMBEDDING_MODEL: Model to use (default: "BAAI/bge-small-en-v1.5")
- Additional kwargs support:
  - device: CPU/GPU selection ("cpu", "cuda", etc.)
  - normalize_embeddings: Whether to normalize vectors
  - Any other kwargs supported by the model
- Features:
  - Local execution capability
  - Wide range of available models
  - Customizable model loading
- Uses llama_index.embeddings.huggingface.HuggingFaceEmbedding
Ollama Embeddings
- Set EMBEDDING_TYPE=ollama in .env
- Required settings:
  - OLLAMA_MODEL: Model to use (default: "nomic-embed-text")
- Optional settings:
  - OLLAMA_API_BASE_URL: Custom API endpoint (default: "http://localhost:11434")
- Additional kwargs support:
  - request_timeout: Specific request timeout
- Features:
  - Local execution
  - Integration with Ollama's model ecosystem
  - No API key required
- Uses llama_index.embeddings.ollama.OllamaEmbedding

Example usage:

from factchecker.core.embeddings import load_embedding_model

# Default OpenAI embeddings
embeddings = load_embedding_model()

# OpenAI with custom settings
embeddings = load_embedding_model(
    embedding_type="openai",
    model_name="text-embedding-ada-002",
    api_key="your-key",
    api_base="custom-endpoint"
)

# HuggingFace with custom settings
embeddings = load_embedding_model(
    embedding_type="huggingface",
    model_name="BAAI/bge-small-en-v1.5",
    device="cuda",
    normalize_embeddings=True
)

# Ollama with custom settings
embeddings = load_embedding_model(
    embedding_type="ollama",
    model_name="nomic-embed-text",
    base_url="http://custom-server:11434",
    request_timeout=60
)

Note: The embedding interface is compatible with LlamaIndex's vector stores, retrievers, and other components. The embeddings are used for semantic search and similarity comparisons in the fact-checking process.

Utility Modules

The factchecker/utils/ directory contains shared functionality used across the project:

verdict_mapping.py: Standardizes the mapping of different verdict labels
- Converts between various rating systems (e.g., Climate Feedback ratings)
- Provides consistent verdict categories across the project
- Includes comprehensive test coverage in tests/utils/test_verdict_mapping.py

Additional utilities handle common operations like data processing, API interactions, and shared helper functions. All utility modules follow TDD principles with corresponding test files in the tests/utils/ directory.

Running Experiments

The project includes several experiment scripts to evaluate different fact-checking approaches:

Climate Feedback Advocate-Mediator Experiment:
```
python -m factchecker.experiments.advocate_mediator_climatefeedback.advocate_mediator_climatefeedback
```
This experiment implements a fact-checking approach using the advocate-mediator pattern with Climate Feedback data. The process:
- Uses Climate Feedback's expert-reviewed claims as ground truth
- Implements an advocate-mediator pattern where:
  - Advocates argue for/against the claim's validity
  - A mediator evaluates the arguments and provides a final verdict
- Verdicts are standardized using the utility mapping system
- Results are compared against expert ratings for evaluation
Building Your Own Advocate-Mediator Experiment:
```
# Core components needed:
from factchecker.strategies.advocate_mediator import AdvocateMediatorStrategy
from factchecker.utils.verdict_mapping import map_verdict

# 1. Define your strategy with custom prompts
strategy = AdvocateMediatorStrategy(
    advocate_prompt="Your custom advocate prompt...",
    mediator_prompt="Your custom mediator prompt..."
)

# 2. Process your claims
claim = "Your claim text..."
context = "Supporting context/evidence..."
result = strategy.evaluate_claim(claim, context)

# 3. Map the verdict to standardized format
standardized_verdict = map_verdict(result.verdict)
```
The experiment structure consists of:
- A strategy class in strategies/ implementing the core logic
- An experiment script in experiments/ handling data loading and evaluation
- Utility functions in utils/ for standardization and common operations
LLM Setup Options:
1. Using Ollama:
```
from factchecker.llm.ollama import OllamaLLM

# Initialize Ollama with your chosen model
llm = OllamaLLM(model_name="llama2")  # or any other supported model
strategy = AdvocateMediatorStrategy(
    advocate_prompt="...",
    mediator_prompt="...",
    llm=llm
)
```
  - Requires Ollama to be installed and running locally
  - No API key needed
  - Supports various open-source models
  - Full control over model deployment and infrastructure
  - Can run completely offline
2. Using OpenAI:
  - Requires setting up OPENAI_API_KEY environment variable
  - Uses GPT models
  - Pay-per-use pricing model
  - No local infrastructure needed

Evidence Evaluation:

python -m factchecker.experiments.evidence_evaluation_1

When running experiments:

Set up your preferred LLM backend (Ollama recommended for development)
Check the experiment's source code for any specific configuration options
Results will typically be saved in the experiment's output directory

Testing

The project follows Test-Driven Development (TDD) principles:

Running Tests:
```
pytest
```
This will run all tests in the tests/ directory.
Writing Tests:
- Create test files in the tests/ directory mirroring the main package structure
- Name test files with test_ prefix (e.g., test_advocate_mediator.py)
- Each test function should start with test_
- Include both positive and negative test cases
- Mock external API calls (e.g., OpenAI) to ensure tests run without actual API usage
Test Coverage:
```
pytest --cov=factchecker
```
This command will show test coverage statistics.

Contribution

You accept the CONTRIBUTOR_LICENSE_AGREEMENT by contributing.

Contact

Connect with us through various channels:

LinkedIn: Climate&Tech
Email: contact@climateandtech.com
Discord: Join our community
Slack: Join our workspace
Web: climateandtech.com

License

This project is dual-licensed:

For researchers, academic institutions, universities, and fact-checking organizations: This project is available under the GNU Affero General Public License (AGPL), with the requirement that any results and improvements are shared back with the community.
For other organizations: A license can be requested and can be granted free if the project's use is not purely commercial, under similar sharing conditions. Please contact the maintainers for licensing details.

See the LICENSE file for the complete terms.

Dealing with sources

Annotate sources

Request access to https://docs.google.com/spreadsheets/d/1R0-q5diheG3zXDBq8V2aoUGOQyRI6HuUisTf-4wTsWY/edit#gid=0

Download sources

The project includes a source downloader tool to fetch PDFs for fact-checking claims. The tool reads a CSV file and downloads files into organized folders.

Download sources

The project includes a source downloader tool to fetch PDFs and other documents referenced in fact-checking claims. The tool reads a CSV file with metadata and downloads the files into organized folders.

1. Basic Usage

python -m factchecker.tools.sources_downloader

This uses default settings:

Source CSV: sources/sources.csv
Output folder: data/sources/
URL column: url
Filename column: output_filename
Subfolder column: output_subfolder

2. Custom Configuration

python -m factchecker.tools.sources_downloader \
  --sourcefile path/to/your_sources.csv \
  --output_folder data/sources \
  --url_column url \
  --output_filename_column output_filename \
  --output_subfolder_column output_subfolder \
  --row_indices 0 1 2  # Optional: download specific rows only

3. Expected CSV Format

Your CSV should include at least the following columns:

url,title,output_filename,output_subfolder
https://example.com/doc1.pdf,Example Report,example_report.pdf,ipcc
https://example.com/doc2.pdf,Another Report,another.pdf,wmo

This will result in files being downloaded to:

data/sources/ipcc/example_report.pdf
data/sources/wmo/another.pdf

4. Features

Automatic creation of output directories and subfolders
URL validation before download
Graceful handling of timeouts, connection errors, and failed downloads
Customizable column names via CLI
Optional row filtering by index for partial downloads
Logging about each downloaded file

5. Programmatic Usage

You can also use the SourcesDownloader in Python directly:

from factchecker.tools.sources_downloader import SourcesDownloader

downloader = SourcesDownloader(output_folder="data/sources")
downloaded_files = downloader.download_pdfs_from_csv(
    sourcefile="sources/sources.csv",
    row_i
71DA
ndices=None,
    url_column="url",
    output_filename_column="output_filename",
    output_subfolder_column="output_subfolder",
)
print(f"Downloaded files: {downloaded_files}")

6. Integration with Indexing

Each subfolder created via the output_subfolder column can be used as an indexable directory:

import os

main_source_directory = "data/sources"
index_subolder = "subfolder_1"
indexer_options_list = [
    {
        'source_directory': os.path.join(main_source_directory, index_subfolder),
        'index_name': "example_index"
    }
]

Note: The /data folder is listed in .gitignore to prevent large source files from being committed to Git.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Climate+Tech FactChecker

Installation

Quick Start

Configuration

Setting Up Environment Variables

Project Structure

Modules explained

Configuring the LLM

Embedding Models

Utility Modules

Running Experiments

Testing

Contribution

Contact

License

Dealing with sources

Annotate sources

Download sources

Download sources

1. Basic Usage

2. Custom Configuration

3. Expected CSV Format

4. Features

5. Programmatic Usage

6. Integration with Indexing

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.github/workflows		.github/workflows
datasets		datasets
experiments		experiments
factchecker		factchecker
sources		sources
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Automated Fact-Checking of Climate Change Claims with Large Language Models.mhtml		Automated Fact-Checking of Climate Change Claims with Large Language Models.mhtml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTOR_LICENSE_AGREEMENT		CONTRIBUTOR_LICENSE_AGREEMENT
LICENSE		LICENSE
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

License

climateandtech/factchecker

Folders and files

Latest commit

History

Repository files navigation

Climate+Tech FactChecker

Installation

Quick Start

Configuration

Setting Up Environment Variables

Project Structure

Modules explained

Configuring the LLM

Embedding Models

Utility Modules

Running Experiments

Testing

Contribution

Contact

License

Dealing with sources

Annotate sources

Download sources

Download sources

1. Basic Usage

2. Custom Configuration

3. Expected CSV Format

4. Features

5. Programmatic Usage

6. Integration with Indexing

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages