🌱 ESG Report Fine-Tuning Dataset

This repository contains the full pipeline and training data used to generate fine-tuning datasets from raw ESG (Environmental, Social, and Governance) reports. These datasets are used to train large language models (LLMs) on tasks such as:

Metadata extraction
Quantitative KPI extraction
Guideline alignment (e.g. GRI disclosures)

📌 Use Case

You can use this dataset to fine-tune or evaluate models like:

OpenAI's gpt-3.5-turbo (via supervised fine-tuning API)
Google's Gemini (via instruction-tuned comparisons)
Anthropic's Claude (in eval or prompt tuning)
Any local transformer model (e.g. FLAN-T5, LLaMA) using Hugging Face

📁 Directory Structure

├── reports/                   # Original ESG reports (PDFs)
├── json_reports/             # Extracted JSON per report
├── all_reports.json          # Combined full text dataset
├── esg_metadata_all.jsonl    # Metadata training records
├── esg_kpi_all.jsonl         # KPI extraction records
├── esg_guideline_all.jsonl   # Guideline alignment records
├── esg_multi_task_dataset.jsonl  # All records merged for multitask fine-tuning
├── readme.txt                # Plaintext summary of process
├── README.md                 # This file (Markdown)

🛠️ Setup Instructions

You can run this entire pipeline using Google Colab or a local Python 3.10+ environment.

1. Clone the repo

git clone https://github.com/your-org/esg-fine-tuning-dataset.git
cd esg-fine-tuning-dataset

2. Install dependencies

pip install pymupdf datasets scikit-learn tqdm

3. Download ESG Reports (from Hugging Face)

from datasets import load_dataset
dataset = load_dataset("DataNeed/company-reports")

4. Convert PDFs to JSON

Use PyMuPDF to extract:

Full report text
Per-page breakdown
Company name and year (from filename)

import fitz
def pdf_to_json(pdf_path): ...

Output will be saved to json_reports/ and all_reports.json.

🧠 Fine-Tuning Tasks

1. Metadata Extraction

Extracts company name, report year, topics (e.g. "climate change", "governance")

2. Quantitative ESG KPI Extraction

Detects numerical disclosures (e.g., "Scope 1 emissions = 5,000 tCO2e")
Maps metrics to GRI IDs via keyword + fuzzy search

3. Guideline Alignment

Matches report chunks to best-fit GRI disclosures using TF-IDF semantic similarity
Includes confidence scores for ranking

🧪 Example Record

{
  "task": "Quantitative ESG data extraction",
  "input": "In 2022, we reduced Scope 
5D6F
1 emissions to 4,500 tCO2e...",
  "output": {
    "disclosures": {
      "305-1": {
        "scope_1_emissions": 4500,
        "unit": "tCO2e"
      }
    }
  }
}

📦 Output Files

esg_metadata_all.jsonl
esg_kpi_all.jsonl
esg_guideline_all.jsonl
esg_multi_task_dataset.jsonl ← all tasks merged

🤖 Fine-Tuning Instructions

You can upload the final JSONL to:

OpenAI: via openai api fine_tunes.create
Gemini / Claude: for side-by-side prompt testing or internal finetuning
Hugging Face: train with Trainer or PEFT for LoRA

🧾 License

This dataset is derived from public ESG disclosures. Use is governed by each source report’s terms. The code in this repo is MIT licensed.

🙌 Credits

Built using:

Contributors: Your Name, Your Organization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌱 ESG Report Fine-Tuning Dataset

📌 Use Case

📁 Directory Structure

🛠️ Setup Instructions

1. Clone the repo

2. Install dependencies

3. Download ESG Reports (from Hugging Face)

4. Convert PDFs to JSON

🧠 Fine-Tuning Tasks

1. Metadata Extraction

2. Quantitative ESG KPI Extraction

3. Guideline Alignment

🧪 Example Record

📦 Output Files

🤖 Fine-Tuning Instructions

🧾 License

🙌 Credits

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
dataset_prep.py		dataset_prep.py
esg_guideline_all.jsonl		esg_guideline_all.jsonl
esg_kpi_all.jsonl		esg_kpi_all.jsonl
esg_metadata_all.jsonl		esg_metadata_all.jsonl
esg_multi_task_dataset.jsonl		esg_multi_task_dataset.jsonl

smartcontracts0/esg-data

Folders and files

Latest commit

History

Repository files navigation

🌱 ESG Report Fine-Tuning Dataset

📌 Use Case

📁 Directory Structure

🛠️ Setup Instructions

1. Clone the repo

2. Install dependencies

3. Download ESG Reports (from Hugging Face)

4. Convert PDFs to JSON

🧠 Fine-Tuning Tasks

1. Metadata Extraction

2. Quantitative ESG KPI Extraction

3. Guideline Alignment

🧪 Example Record

📦 Output Files

🤖 Fine-Tuning Instructions

🧾 License

🙌 Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages