Rust LLM Tool with Local PDF Knowledge Base

A self-hosted LLM API powered by Rust, providing local PDF document indexing for Retrieval-Augmented Generation (RAG). Built with Axum, SQLite, and Ollama.

⚠️ Note: Learning Project Notice This my learning project to explore Rust web development, as 8000 ync programming, and LLM integration. For production use, consider established solutions like:

LlamaIndex

LangChain

Chroma

Qdrant

🎯 Purpose & Learning Goals

This project demonstrates:

Building async web services with Rust and Axum
Implementing basic RAG (Retrieval-Augmented Generation)
Integrating with local LLMs via Ollama
Working with SQLite in Rust
PDF processing and text extraction
Error handling and logging in Rust

🏗️ Architecture

Key Components

Web Server: Axum for routing and request handling
Database: SQLite with async support via SQLx
LLM Integration: Ollama for local model inference
PDF Processing: pdf-extract for text extraction
Error Handling: Custom error types with thiserror
Logging: env_logger for structured logging

Data Flow

PDFs are processed and stored in SQLite
User submits query via REST API
Relevant context is retrieved from stored PDF content
Query + context sent to LLM via Ollama
Response returned to user

🚀 Getting Started

Prerequisites

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install SQLx CLI
cargo install sqlx-cli

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Install pdftotext (poppler-utils)
# For Ubuntu/Debian:
sudo apt-get install poppler-utils
# For macOS:
brew install poppler
# For Fedora:
sudo dnf install poppler-utils

Setup

# Clone repository
git clone https://github.com/cmcconnell1/rust-llm-tool.git
cd rust-llm-tool

# Run setup script
bash setup.sh

# Start server
RUST_LOG=info cargo run --release

🔧 Configuration

Edit config.yaml to customize:

default_model: "gemma"  # Ollama model to use
database_url: "sqlite://./data/database.db"
api_port: 8000
pdf_directory: "./pdfs"

📡 API Endpoints

Query LLM

curl -X POST "http://127.0.0.1:8000/query" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "What is Rust?"}'

Batch Query

curl -X POST "http://127.0.0.1:8000/batch-query" \
     -H "Content-Type: application/json" \
     -d '[
         {"prompt": "What is Rust?"},
         {"prompt": "Explain async/await"}
     ]'

Validate PDF Knowledge

curl http://127.0.0.1:8000/validate-pdfs

📚 Code Structure

Key Files

src/main.rs: Server setup and routing
src/api.rs: API endpoint handlers
src/db.rs: Database initialization and queries
src/pdf.rs: PDF processing logic
src/ollama.rs: LLM integration
src/config.rs: Configuration management

Error Handling

Custom error types for each module
Proper error propagation using Result
Structured logging with different levels

Async/Await

Uses Tokio runtime for async operations
Proper connection pooling with SQLx
Async HTTP clients for Ollama integration

🔜 Future Enhancements

Implement vector embeddings for better retrieval
Add response streaming support
Improve PDF chunking strategy
Add caching layer
Create web dashboard

📖 Learning Resources

📚 PDF Processing

Directory Structure

PDFs are stored in the pdfs/ directory with subdirectories for different categories:

pdfs/
├── cloud/
├── programming/
├── k8s/
├── SQL/
└── security/

Processing PDFs

The system recursively processes PDFs from all subdirectories:

# Process all PDFs (with detailed logging)
RUST_LOG=info cargo run --bin process_pdfs

# Or use the validation endpoint
curl http://127.0.0.1:8000/validate-pdfs

PDF Processing Details

System Requirements

pdftotext utility (part of poppler-utils) must be installed on your system
Maximum PDF file size: 100MB
Minimum content length: 100 characters

Why pdftotext?

The project uses the system's pdftotext utility instead of pure Rust PDF processing because:

PDF format complexity: The PDF specification is extensive and complex, making pure Rust implementations either incomplete or unstable for production use
Reliability: Popular system utilities like pdftotext have been battle-tested for years and handle various PDF variants and edge cases
Performance: Native system utilities often provide better performance than pure Rust alternatives
Security: System PDF utilities typically include security patches and handle malformed PDFs safely

While pure Rust alternatives exist (like pdf-extract crate), they often:

Lack support for complex PDF features
Have issues with certain PDF encodings
May crash on malformed PDFs
Don't handle all PDF security features

The tradeoff is the external dependency, but the benefits in reliability and feature support outweigh this drawback for this learning project.

Known PDF Issues

Some PDFs may contain corrupt data streams
Unicode mismatches with certain ligatures (fi, fl, ff)
Large PDFs (>100MB) are skipped
PDFs with insufficient text content are skipped

PDF Processing Rules

Maximum file size: 100MB
Minimum content length: 100 characters
Supported formats: Standard PDF files
Unicode normalization is applied
Common ligatures are converted to standard characters
Relative paths are preserved in database

Troubleshooting PDF Processing

If PDFs fail to process:

Check file permissions
Verify PDF is not corrupted
Ensure file size is under 100MB
Check logs for specific error messages
Try processing individual files for debugging

📝 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
migrations		migrations
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rust LLM Tool with Local PDF Knowledge Base

🎯 Purpose & Learning Goals

🏗️ Architecture

Key Components

Data Flow

🚀 Getting Started

Prerequisites

Setup

🔧 Configuration

📡 API Endpoints

Query LLM

Batch Query

Validate PDF Knowledge

📚 Code Structure

Key Files

Error Handling

Async/Await

🔜 Future Enhancements

📖 Learning Resources

📚 PDF Processing

Directory Structure

Processing PDFs

PDF Processing Details

System Requirements

Why pdftotext?

Known PDF Issues

PDF Processing Rules

Troubleshooting PDF Processing

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

cmcconnell1/rust-llm-tool

Folders and files

Latest commit

History

Repository files navigation

Rust LLM Tool with Local PDF Knowledge Base

🎯 Purpose & Learning Goals

🏗️ Architecture

Key Components

Data Flow

🚀 Getting Started

Prerequisites

Setup

🔧 Configuration

📡 API Endpoints

Query LLM

Batch Query

Validate PDF Knowledge

📚 Code Structure

Key Files

Error Handling

Async/Await

🔜 Future Enhancements

📖 Learning Resources

📚 PDF Processing

Directory Structure

Processing PDFs

PDF Processing Details

System Requirements

Why pdftotext?

Known PDF Issues

PDF Processing Rules

Troubleshooting PDF Processing

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages