This repository contains the full pipeline and training data used to generate fine-tuning datasets from raw ESG (Environmental, Social, and Governance) reports. These datasets are used to train large language models (LLMs) on tasks such as:
- Metadata extraction
- Quantitative KPI extraction
- Guideline alignment (e.g. GRI disclosures)
You can use this dataset to fine-tune or evaluate models like:
- OpenAI's
gpt-3.5-turbo
(via supervised fine-tuning API) - Google's Gemini (via instruction-tuned comparisons)
- Anthropic's Claude (in eval or prompt tuning)
- Any local transformer model (e.g. FLAN-T5, LLaMA) using Hugging Face
├── reports/ # Original ESG reports (PDFs)
├── json_reports/ # Extracted JSON per report
├── all_reports.json # Combined full text dataset
├── esg_metadata_all.jsonl # Metadata training records
├── esg_kpi_all.jsonl # KPI extraction records
├── esg_guideline_all.jsonl # Guideline alignment records
├── esg_multi_task_dataset.jsonl # All records merged for multitask fine-tuning
├── readme.txt # Plaintext summary of process
├── README.md # This file (Markdown)
You can run this entire pipeline using Google Colab or a local Python 3.10+ environment.
git clone https://github.com/your-org/esg-fine-tuning-dataset.git
cd esg-fine-tuning-dataset
pip install pymupdf datasets scikit-learn tqdm
from datasets import load_dataset
dataset = load_dataset("DataNeed/company-reports")
Use PyMuPDF
to extract:
- Full report text
- Per-page breakdown
- Company name and year (from filename)
import fitz
def pdf_to_json(pdf_path): ...
Output will be saved to
json_reports/
andall_reports.json
.
- Extracts company name, report year, topics (e.g. "climate change", "governance")
- Detects numerical disclosures (e.g., "Scope 1 emissions = 5,000 tCO2e")
- Maps metrics to GRI IDs via keyword + fuzzy search
- Matches report chunks to best-fit GRI disclosures using TF-IDF semantic similarity
- Includes confidence scores for ranking
{
"task": "Quantitative ESG data extraction",
"input": "In 2022, we reduced Scope
5D6F
1 emissions to 4,500 tCO2e...",
"output": {
"disclosures": {
"305-1": {
"scope_1_emissions": 4500,
"unit": "tCO2e"
}
}
}
}
esg_metadata_all.jsonl
esg_kpi_all.jsonl
esg_guideline_all.jsonl
esg_multi_task_dataset.jsonl
← all tasks merged
You can upload the final JSONL to:
- OpenAI: via
openai api fine_tunes.create
- Gemini / Claude: for side-by-side prompt testing or internal finetuning
- Hugging Face: train with
Trainer
orPEFT
for LoRA
This dataset is derived from public ESG disclosures. Use is governed by each source report’s terms. The code in this repo is MIT licensed.
Built using:
Contributors: Your Name, Your Organization