8000 GitHub - msroot/docuGraphRAG.js: A document processing and RAG library transforming text into knowledge graphs with user-driven entity extraction for natural document conversation.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

A document processing and RAG library transforming text into knowledge graphs with user-driven entity extraction for natural document conversation.

License

Notifications You must be signed in to change notification settings

msroot/docuGraphRAG.js

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

82 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ docuGraphRAG.js

A document processing and RAG (Retrieval-Augmented Generation) library that transforms unstructured text from documents into rich knowledge graphs and performs entity and relation extraction based on the user's input analysis scope. By using a graph database for enhanced context retrieval, it enables natural conversation with your documents through a chat interface.

docuGraphRAG.js

πŸ“– Project Evolution

docuGraphRAG.js is the successor of docuRAG.js, representing a significant architectural shift in how we handle document context and relationships:

  • Hybrid search combining vector, text, and graph-based approaches
  • Entity extraction and relationship modeling based on the user's scope.
  • Efficient storage and retrieval using Neo4j
  • Streaming responses for real-time chat interactions
  • Configurable search strategies and weights

Features 🌟

1. Document Processing

  • Splits documents into manageable chunks
  • Generates vector embeddings for each chunk
  • Stores content in Neo4j for efficient retrieval

2. Hybrid Search System

  • Multiple search strategies working together:
    • Vector similarity search (40% weight)
    • Full-text search (30% weight)
    • Graph-based search (30% weight)
  • Intelligent result merging and ranking
  • Configurable search options
  • Efficient storage and retrieval

3. Basic Graph Storage

  • Stores documents as connected chunks
  • Uses Neo4j for efficient storage
  • Basic document-chunk relationships
  • Entity relationship tracking
  • Graph traversal capabilities

4. Chat Interface

  • Natural language interaction with documents
  • Context-aware responses
  • Streaming response generation
  • Multiple search strategies:
    • Vector similarity search
    • Full-text search
    • Graph-based search
  • Configurable search options

πŸ› οΈ Prerequisites

  • Node.js 18+
  • Docker / Neo4j Database
  • OpenAI API Key (gpt-4 and text-embedding-3-small)

πŸš€ Quick Start

  1. Clone the repository:
git clone https://github.com/msroot/docuGraphRAG.js.git
cd docuGraphRAG.js
  1. Install dependencies:
npm install
  1. Start Neo4j (using Docker):
docker-compose up -d neo4j

Access the Neo4j Browser at http://localhost:7474/browser/ to explore your graph database.

Run the example app

  1. Create environment file:
cd app/
# Create a new .env file
touch .env

# Add the following configuration to your .env file:
NEO4J_URL=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password
OPENAI_API_KEY=your-openai-key
  1. Run the example app:
npm install
npm start

docuGraphRAG.js

The example app provides a complete web interface featuring:

  • Document upload and management
  • Interactive chat interface with streaming responses
  • Real-time Knowledge Graph visualization
    • Visual representation of document relationships
    • Entity connections and relationships
    • Expandable full-screen view
    • Interactive node exploration
  • Configurable search strategies:
    • Vector similarity search
    • Full-text search
    • Graph-based search
  • Direct access to Neo4j Browser for advanced queries

Visit http://localhost:3000 to explore the interface.

βš™οΈ Configuration

Parameter Type Default Description
Required Settings
neo4jUrl string - Neo4j database connection URL
neo4jUser string - Neo4j database username
neo4jPassword string - Neo4j database password
openaiApiKey string - Your OpenAI API key
Optional Settings
chunkSize number 1000 Size of document chunks in characters
chunkOverlap number 200 Overlap between consecutive chunks
similarityThreshold number 0.1 Minimum similarity score for vector search
vectorSearchWeight number 0.4 Weight for vector similarity search (0-1)
textSearchWeight number 0.3 Weight for full-text search (0-1)
graphSearchWeight number 0.3 Weight for graph-based search (0-1)

Example configuration in code:

import { DocuGraphRAG } from 'docugraphrag';

const config = {
    neo4jUrl: process.env.NEO4J_URL,
    neo4jUser: process.env.NEO4J_USER,
    neo4jPassword: process.env.NEO4J_PASSWORD,
    openaiApiKey: process.env.OPENAI_API_KEY,
    // Optional settings
    chunkSize: 1000,
    chunkOverlap: 200,
    similarityThreshold: 0.1,
    vectorSearchWeight: 0.4,
    textSearchWeight: 0.3,
    graphSearchWeight: 0.3
};

const rag = new DocuGraphRAG(config);

πŸ’» Usage Example

import { DocuGraphRAG } from 'docugraphrag';

// Initialize
const rag = new DocuGraphRAG(config);
await rag.initialize();

// Process a document
const result = await rag.processDocument(text, "Analysis focus description");

// Chat with the document
const answer = await rag.chat("Who is Dr. Sarah Jones?", {
    documentIds: ["doc123"],
    vectorSearch: true,
    textSearch: true,
    graphSearch: true
});

System Architecture πŸ—οΈ

graph TD
    subgraph Document_Processing ["πŸ“„ Document Processing"]
        DocInput[/"PDF/Text Document"/] --> TextProc["Text Extraction & Cleaning"]
        TextProc --> Chunking["Smart Chunking"]
        Chunking -->|"Overlapping chunks"| Chunks[(Processed Chunks)]
    end

    subgraph Knowledge_Graph_Creation ["🧠 Knowledge Graph Creation"]
        Chunks --> |"Chunk text"| VectorGen["Vector Embedding Generation"]
        Chunks --> |"Content analysis"| EntityExt["Entity Extraction"]
        
        EntityExt --> |"Named entities"| RelEngine["Relationship Engine"]
        RelEngine --> |"Entity pairs"| RelCreation["Relationship Creation"]
        
        subgraph Neo4j_Storage ["πŸ“Š Neo4j Database"]
            GraphDB[("Neo4j Graph DB")]
            Indexes["Custom Indexes"]
            TextIndex["Full-Text Search Index"]
            VectorIndex["Vector Index"]
            GraphDB --> |"Indexed by"| Indexes
            GraphDB --> |"Text indexing"| TextIndex
            GraphDB --> |"Vector indexing"| VectorIndex
        end
        
        VectorGen --> |"Store embeddings"| GraphDB
        RelCreation --> |"Store relationships"| GraphDB
        Chunks --> |"Store chunks"| GraphDB
    end

    subgraph Query_Processing ["πŸ” Query Processing"]
        UserQuery[/"User Question"/] --> QueryEmbed["Query Embedding"]
        UserQuery --> QueryEntity["Query Entity Recognition"]
        UserQuery --> TextSearch["Text Search Processing"]
        
        subgraph Search_System ["Hybrid Search System"]
            QueryEmbed --> |"Vector similarity (40%)"| HybridSearch
            QueryEntity --> |"Entity matching (30%)"| HybridSearch
            TextSearch --> |"Text matching (30%)"| HybridSearch
            
            TextIndex --> |"Full-text results"| TextSearch
            VectorIndex --> |"Vector results"| HybridSearch
            GraphDB --> |"Graph traversal"| HybridSearch
            
            HybridSearch --> |"Ranked & merged results"| ContextFusion
        end
        
        ContextFusion --> |"Combined context"| RespGen["Response Generation"]
        RespGen --> Answer[/"Final Answer"/]
    end

    %% Styling
    classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px;
    classDef process fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px;
    classDef database fill:#fff3e0,stroke:#ef6c00,stroke-width:2px;
    classDef search fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px;
    classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px;
    
    class DocInput,UserQuery input;
    class TextProc,Chunking,VectorGen,EntityExt,RelEngine,RelCreation,QueryEmbed,QueryEntity,TextSearch process;
    class GraphDB,Indexes,TextIndex,VectorIndex database;
    class HybridSearch,ContextFusion search;
    class Answer output;

    %% Subgraph styling
    style Document_Processing fill:#f8f9fa,stroke:#343a40,stroke-width:2px;
    style Knowledge_Graph_Creation fill:#f8f9fa,stroke:#343a40,stroke-width:2px;
    style Query_Processing fill:#f8f9fa,stroke:#343a40,stroke-width:2px;
    style Neo4j_Storage fill:#fff3e0,stroke:#ef6c00,stroke-width:2px;
    style Search_System fill:#f3e5f5,stroke:#6a1b9a,stroke-width:2px;

    %% Relationships
    linkStyle default stroke:#666,stroke-width:2px;
Loading

How It Works πŸ”

1. Document Processing

  • Splits documents into manageable chunks
  • Generates vector embeddings for each chunk
  • Stores content in Neo4j database

2. Search Process

When you ask a question:

  1. Converts question to vector embedding
  2. Finds relevant document chunks using multiple strategies:
    • Vector similarity search
    • Full-text search
    • Graph-based search
  3. Merges and ranks results
  4. Generates comprehensive answer

3. Data Structure

(Document)-[:HAS_CHUNK]->(DocumentChunk)
(DocumentChunk)-[:HAS_ENTITY]->(Entity)
(Entity)-[:RELATES_TO]->(Entity)

Contributing 🀝

We welcome contributions! Please check our contributing guidelines for more information.

Disclaimer ⚠️

RESEARCH PURPOSES ONLY. This project, docuGraphRAG.js, is strictly intended for research and educational exploration. It has not been designed or tested for production environments and may contain limitations, errors, or security vulnerabilities.

License πŸ“„

MIT License

Support πŸ’¬

  • Create an issue for bug reports
  • Start a discussion for feature requests
  • Check our documentation for guides

Built with ❀️ by Yannis Kolovos

About

A document processing and RAG library transforming text into knowledge graphs with user-driven entity extraction for natural document conversation.

Resources

License

Stars

Watchers

Forks

0