Document Extraction

Document Extraction is an AI-powered tool for extracting structured data from unstructured documents (PDF, Word, images). It uses OpenAI and LlamaIndex for prompt-driven, flexible data extraction.

Features

Supports PDF, Word, Xlsx, images
Customizable extraction via prompt
Batch processing
Outputs JSON
FastAPI backend

Quick Start

Install dependencies
```
poetry install
```
Configure environment
- Set your OpenAI API key and extraction prompt.
Start the API
```
poetry run uvicorn main:app --reload
```
Use the API
- Upload documents and set your prompt via /extract endpoint.
- See docs at http://localhost:8000/docs.

Example Prompt

prompt: "Extract the contract parties and the signing date from the document."

Tech Stack

FastAPI, OpenAI, LlamaIndex, OCR tools

Support

Open an issue for help or feedback.

DeepWiki

https://deepwiki.com/akulubala/document-extraction/1-overview

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
document_extraction		document_extraction
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Extraction

Features

Quick Start

Example Prompt

Tech Stack

Support

DeepWiki

About

Releases

Packages

Languages

akulubala/document-extraction

Folders and files

Latest commit

History

Repository files navigation

Document Extraction

Features

Quick Start

Example Prompt

Tech Stack

Support

DeepWiki

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages