chunk-my-docs

lumina x trieve cookup. SOTA pdf extraction

woooo lfg

HO w to chukn a PDF??

Local Dev Guide

Our setup runs the Rust actix-web server locally on metal and everything else in Docker. pdla (pdf-document-layout-analysis) is meant to run on GPU so you may find it to be slow when running locally on CPU.

1. Setup ENV's

cp .env.docker-compose .env

cp .env.chunkmydocs ./chunkmydocs/.env

cp .env.pyscripts ./pyscripts/.env

2. Run the things

docker compose up -d

Then, run the server and task worker:

cd chunkmydocs
cargo run
cargo run --bin task-processor

3. Get local API key

Run the following curl script to get an API key:

curl -X POST http://localhost:8000/api_key \
  -H "Content-Type: application/json" \
  -d '{
    "user_id": "example_user_id",
    "email": "givme@apikey.com",
    "access_level": "OWNER",
    "expires_at": "2023-12-31T23:59:59Z",
    "initial_usage": 0,
    "usage_limit": 100000,
    "usage_type": "FREE",
    "service_type": "EXTRACTION"
  }'

Copy the resulting key.

Paste the key into pyscripts/.env as the value for INGEST_SERVER__API_KEY.

4. Test that things are working

cd pyscripts && mkdir input && mkdir output

Then, put some PDF into the ./pyscripts/input folder. I recommend Justice Department Sues Apple for Monopolizing Smartphone Markets.

cd pyscripts && python3 main.py

Once that finishes, you can view the resulting chunks in pyscripts/output/{file_name}-Fast/bounding_boxes.json.

Roadmap

integrate with Trieve
add support for Grobid
make a diagram
explain how insanely awesome RRQ is
Kube deploy guide similar to trieve/self-hosting.md

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
.github/workflows		.github/workflows
.vscode		.vscode
apps/web		apps/web
chunkmydocs		chunkmydocs
docker		docker
models		models
packages		packages
pyscripts		pyscripts
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.env.chunkmydocs		.env.chunkmydocs
.env.docker-compose		.env.docker-compose
.env.pyscripts		.env.pyscripts
.gitignore		.gitignore
.npmrc		.npmrc
README.md		README.md
docker-compose.yml		docker-compose.yml
git.sh		git.sh
meta.json		meta.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
postcss.config.cjs		postcss.config.cjs
pr-branch.sh		pr-branch.sh
tailwind.config.cjs		tailwind.config.cjs
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

chunk-my-docs

Local Dev Guide

1. Setup ENV's

2. Run the things

3. Get local API key

4. Test that things are working

Roadmap

About

Uh oh!

Releases

Packages

Languages

License

suryatmodulus/chunkr

Folders and files

Latest commit

History

Repository files navigation

chunk-my-docs

Local Dev Guide

1. Setup ENV's

2. Run the things

3. Get local API key

4. Test that things are working

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages