Extractous offers a unified approach for detecting and extracting metadata and text content from various documents types such as PDF, Word, HTML, and many other formats. Our goal is to deliver an efficient comprehensive solution with bindings for many programming languages.
Extractous was mainly inspired by the Unstructured Python library. While Unstructured offers a good solution for parsing unstructured content, we see 2 main issues with it:
- Performance: data processing is mainly a cpu-bound problem and Python is not the best choice for such tasks because of its Global Interpreter Lock (GIL) which makes it hard to utilize multiple cores.
- Unstructured is becoming more of an LLM framework rather than just text and metadata parsing library.
Extractous will focus only on the text and metadata extraction part. The core is written in Rust, leveraging its memory safety, multithreading and zero cost abstractions. Extractous will provide bindings for many programming languages.
- Clear simple API for extracting text and metadata content.
- Support for many file formats.
- Strives to be efficient and fast.
- Comprehensive documentation and examples to help you get started quickly.
Name | Release |
---|---|
Rust Core | |
Pytho 655E n Binding |
File Format | Rust Core | Python Binding |
---|---|---|
✅ | ✅ | |
csv | ✅ | ✅ |