8000 GitHub - abhiya492/pdf-data-extraction-tool: Python tool for extracting and analyzing data from PDF documents
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

abhiya492/pdf-data-extraction-tool

Repository files navigation

PDF Data Extraction and Analysis Tool

A Python tool for extracting structured data from PDF files (like invoices or reports), processing this information, and generating actionable insights.

Features

  • Document Processing: Extracts key information from PDF documents using pattern recognition
  • Data Organization: Transforms raw data into structured formats (CSV, Excel, JSON)
  • Automated Analysis: Performs calculations to identify trends, totals, and anomalies
  • Visualization: Generates charts and summary reports to highlight key findings

Installation

  1. Clone the repository:

    git clone https://github.com/abhiya492/pdf-data-extraction-tool.git
    cd pdf_analyzer
    
  2. Install the required packages:

    pip install -r requirements.txt
    

Usage

The tool has two main modes: invoice processing and report processing.

Processing Invoices

python src/main.py --type invoice --input sample_pdfs/invoices --output invoice-output

Processing Reports

python src/main.py --type report --input sample_pdfs/reports --output report-output

Output

The tool generates several output files:

  • CSV and Excel files: Containing structured data extracted from PDFs
  • JSON files: Raw extracted data and analysis insights
  • Charts: Visual representations of the data and anomalies

Project Structure

pdf_analyzer/
├── src/
│   ├── pdf_extractor.py  - PDF data extraction classes
│   ├── data_processor.py - Data processing and analysis
│   ├── visualizer.py     - Data visualization
│   └── main.py           - Main application entry point
├── tests/
│   └── test_extractor.py - Tests for PDF extractor
├── sample_pdfs/          - Sample PDF files for testing
└── requirements.txt      - Required Python packages

How It Works

  1. PDF Extraction: The tool uses pattern matching and regular expressions to extract key information from PDFs.
  2. Data Processing: Extracted data is cleaned and organized into structured formats.
  3. Analysis: The tool calculates statistics, identifies trends, and detects anomalies.
  4. Visualization: Results are presented in charts to make the insights easily digestible.

Customization

The extraction patterns can be customized by modifying the regular expressions in the InvoiceExtractor and ReportExtractor classes to fit your specific PDF formats.

Testing

Run the tests with:

python -m unittest discover tests

License

MIT

About

Python tool for extracting and analyzing data from PDF documents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0