8000 GitHub - akulubala/document-extraction: Document Extraction system is a specialized solution designed to extract structured product information from various document types, including PDFs, DOCX files, Excel spreadsheets, and images.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Document Extraction system is a specialized solution designed to extract structured product information from various document types, including PDFs, DOCX files, Excel spreadsheets, and images.

Notifications You must be signed in to change notification settings

akulubala/document-extraction

Repository files navigation

Document Extraction

Document Extraction is an AI-powered tool for extracting structured data from unstructured documents (PDF, Word, images). It uses OpenAI and LlamaIndex for prompt-driven, flexible data extraction.

Features

  • Supports PDF, Word, Xlsx, images
  • Customizable extraction via prompt
  • Batch processing
  • Outputs JSON
  • FastAPI backend

Quick Start

  1. Install dependencies

    poetry install
  2. Configure environment

    • Set your OpenAI API key and extraction prompt.
  3. Start the API

    poetry run uvicorn main:app --reload
  4. Use the API

    • Upload documents and set your prompt via /extract endpoint.
    • See docs at http://localhost:8000/docs.

Example Prompt

prompt: "Extract the contract parties and the signing date from the document."

Tech Stack

  • FastAPI, OpenAI, LlamaIndex, OCR tools

Support

Open an issue for help or feedback.

DeepWiki

https://deepwiki.com/akulubala/document-extraction/1-overview

About

Document Extraction system is a specialized solution designed to extract structured product information from various document types, including PDFs, DOCX files, Excel spreadsheets, and images.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

0