“Compact brains, sharp answers.”
A unified, efficient RAG pipeline optimized for edge & local deployments.
xRAG (eXtreme Retrieval Augmented Generation) is a token-compressed, quantized, and fully local RAG system. It’s engineered for low-latency, on-device question answering and document retrieval—perfect for applications that demand speed, privacy, and portability.
This project draws from cutting-edge RAG and efficient model deployment research:
- xRAG (2022): 1-token RAG with 17x compression using MLP bridges and I-token projection.
Most RAG pipelines:
- Are too bulky for local use
- Depend on cloud services
- Struggle with low-latency, high-efficiency deployment
xRAG aims to solve this by going fully local, compressing input, and minimizing memory & compute overhead.
- ✅ Fully Local: Works offline, on-device, no cloud required.
- 🔻 Token Compression: Fewer tokens, faster inference.
- ⚡ Quantized Models: Lightweight, edge-ready transformers.
- 🧩 Modular Components: Swap retrievers, bridges, and generators.
- 📚 Document-Aware: Pulls relevant context before answering.
- 🧠 Context Memory: Maintains logical conversation threads.
- 🧑🏫 AI Tutors (offline, classroom-ready)
- 🔐 Private assistants (no cloud dependencies)
- 🧭 Search across personal documents
- 🧠 Lightweight research copilots