A Python tool for extracting structured data from PDF files (like invoices or reports), processing this information, and generating actionable insights.
- Document Processing: Extracts key information from PDF documents using pattern recognition
- Data Organization: Transforms raw data into structured formats (CSV, Excel, JSON)
- Automated Analysis: Performs calculations to identify trends, totals, and anomalies
- Visualization: Generates charts and summary reports to highlight key findings
-
Clone the repository:
git clone https://github.com/abhiya492/pdf-data-extraction-tool.git cd pdf_analyzer
-
Install the required packages:
pip install -r requirements.txt
The tool has two main modes: invoice processing and report processing.
python src/main.py --type invoice --input sample_pdfs/invoices --output invoice-output
python src/main.py --type report --input sample_pdfs/reports --output report-output
The tool generates several output files:
- CSV and Excel files: Containing structured data extracted from PDFs
- JSON files: Raw extracted data and analysis insights
- Charts: Visual representations of the data and anomalies
pdf_analyzer/
├── src/
│ ├── pdf_extractor.py - PDF data extraction classes
│ ├── data_processor.py - Data processing and analysis
│ ├── visualizer.py - Data visualization
│ └── main.py - Main application entry point
├── tests/
│ └── test_extractor.py - Tests for PDF extractor
├── sample_pdfs/ - Sample PDF files for testing
└── requirements.txt - Required Python packages
- PDF Extraction: The tool uses pattern matching and regular expressions to extract key information from PDFs.
- Data Processing: Extracted data is cleaned and organized into structured formats.
- Analysis: The tool calculates statistics, identifies trends, and detects anomalies.
- Visualization: Results are presented in charts to make the insights easily digestible.
The extraction patterns can be customized by modifying the regular expressions in the InvoiceExtractor
and ReportExtractor
classes to fit your specific PDF formats.
Run the tests with:
python -m unittest discover tests
MIT