10000 GitHub - yutannihilation/extractous at v0.1.3
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

License

Notifications You must be signed in to change notification settings

yutannihilation/extractous

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extractous

Extractous offers a unified approach for detecting and extracting metadata and text content from various documents types such as PDF, Word, HTML, and many other formats. Our goal is to deliver an efficient comprehensive solution with bindings for many programming languages.

Why Extractous?

Extractous was mainly inspired by the Unstructured Python library. While Unstructured offers a good solution for parsing unstructured content, we see 2 main issues with it:

  • Performance: data processing is mainly a cpu-bound problem and Python is not the best choice for such tasks because of its Global Interpreter Lock (GIL) which makes it hard to utilize multiple cores.
  • Unstructured is becoming more of an LLM framework rather than just text and metadata parsing library.

Extractous will focus only on the text and metadata extraction part. The core is written in Rust, leveraging its memory safety, multithreading and zero cost abstractions. Extractous will provide bindings for many programming languages.

Features

  • Clear simple API for extracting text and metadata content.
  • Support for many file formats.
  • Strives to be efficient and fast.
  • Comprehensive documentation and examples to help you get started quickly.

Bindings

Name Release
Rust Core
Pytho 655E n Binding

Supported file formats

File Format Rust Core Python Binding
pdf
csv

About

Fast and efficient unstructured data extraction. Written in Rust with bindings for many languages.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Rust 73.7%
  • Java 13.3%
  • Python 9.6%
  • Shell 1.5%
  • Jupyter Notebook 1.0%
  • Dockerfile 0.9%
0