A Retrieval-Augmented Generation (RAG) system for semantic document search and embedding management.
- Document processing and embedding generation using BERT
- Support for multiple file formats (PDF, DOCX, TXT, Python files)
- Semantic search with filtering by file type and similarity threshold
- Web-based user interface for document management
- Document preview functionality
- Python 3.8+
- Flask
- PyTorch
- Transformers (Hugging Face)
- SQLite3
- python-docx
- PyPDF2
- Clone the repository:
git clone https://github.com/yourusername/RAGfus.git
cd RAGfus
- Install requirements:
pip install -r requirements.txt
- Run the application:
python app.py
The application will be available at http://localhost:5000
The web interface provides the following functionality:
- Upload individual documents
- Process entire directories
- Search documents semantically
- Preview document content
- Manage documents (delete, view)
POST /upload
: Upload and process a documentPOST /process_directory
: Process all documents in a directoryPOST /search
: Find documents semantically similar to a queryGET /documents
: List all documents in the databaseGET /documents/{id}/preview
: Preview document contentDELETE /documents/{id}
: Delete a document
RAGfus uses BERT embeddings to represent documents semantically. When a document is uploaded or processed, the system:
- Extracts text content from the document
- Generates embeddings using a BERT model
- Stores the document and its embedding in a SQLite database
When searching, RAGfus:
- Generates an embedding for the search query
- Computes similarity between the query and all documents
- Returns the most similar documents based on cosine similarity
MIT
- Agustín Conesa