zamalali · zamalali · Mar 31, 2025 · Mar 22, 2025 · Mar 22, 2025 · Mar 23, 2025
diff --git a/.env b/.env
@@ -0,0 +1,5 @@
+GITHUB_API_KEY=your_github_api_key
+HUGGINGFACE_TOKEN=your_huggingface_token
+GROQ_API_KEY=your_groq_api_key
+export PYDANTIC_V1_COMPATIBLE_MODE="true"
+LANGSMITH_API_KEY=your_langsmith_api_key
diff --git a/.github/workflows/docker-image.yml b/.github/workflows/docker-image.yml
@@ -0,0 +1,32 @@
+name: Build Docker Image (DeepGit)
+
+on:
+  workflow_dispatch:  # allows manual trigger
+  push:
+    branches:
+      - dev  # triggers on dev branch
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v3
+
+      - name: Set up Docker Buildx
+        uses: docker/setup-buildx-action@v3
+
+      - name: Build Docker image
+        run: docker build -t deepgit-app .
+
+      - name: Log in to Docker Hub
+        uses: docker/login-action@v2
+        with:
+          username: ${{ secrets.DOCKER_USERNAME }}
+          password: ${{ secrets.DOCKER_PASSWORD }}
+
+      - name: Tag and Push Docker image
+        run: |
+          docker tag deepgit-app at384/deepgit-app:latest
+          docker push at384/deepgit-app:latest
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,24 @@
+# Use a slim Python 3.10 image as the base
+FROM python:3.10-slim
+
+# Install system dependencies (if needed for building some Python packages)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+    build-essential \
+    && rm -rf /var/lib/apt/lists/*
+
+# Set the working directory in the container
+WORKDIR /app
+
+# Copy requirements.txt and install Python dependencies
+COPY requirements.txt .
+RUN pip install --upgrade pip && pip install -r requirements.txt
+
+# Copy the rest of the code into the container
+COPY . .
+
+# Expose the default port for Gradio (if you want to access the app externally)
+EXPOSE 7860
+
+# Set the command to run your app
+CMD ["python", "app.py"]
diff --git a/README.md b/README.md
@@ -3,45 +3,90 @@
 </h1>
 
 <p alig
F438
n="center">
-  <img src="assets/workflow.png" width="120%" alt="Workflow Diagram"/>
+  <img src="assets/flow.png" alt="Langgraph Workflow Diagram" style="max-width: 800px; width: 100%; height: auto;" />
 </p>
 
 ## DeepGit
 
-**DeepGit** is an autonomous agent designed to perform deep semantic research across GitHub repositories. It intelligently searches, analyzes, and ranks repositories based on user intent — even for less-known but highly relevant tools.
+**DeepGit** is an advanced, Langgraph-based agentic workflow designed to perform deep research across GitHub repositories. It intelligently searches, analyzes, and ranks repositories based on user intent—even uncovering less-known but highly relevant tools. DeepGit infuses hybrid dense retrieval with advanced cross-encoder re-ranking and comprehensive activity analysis into a unified, open-source platform for intelligent repository discovery
+
+---
 
 ## ⚙️ How It Works — Agentic Workflow
 
-When a user submits a query, **DeepGit Orchestrator Agent** takes over. Here's the breakdown of the pipeline:
+When a user submits a query, the **DeepGit Orchestrator Agent** takes over, passing the query through a series of specialized tools:
+
+1. **Query Expansion Tool**  
+   Enhances vague user queries using language models to add specificity and context, enabling more accurate downstream retrieval.
 
-### 🔹 1. Query Expansion Tool
-Enhances vague user queries using language models to add specificity and context — enabling more accurate downstream retrieval.
+2. **Semantic Retrieval Tool**  
+   Leverages cutting-edge embedding models to semantically match the enhanced query against a wide array of GitHub repositories.
 
-### 🔹 2. Semantic Retrieval Tool
-Uses state-of-the-art embedding models to semantically match the enhanced query against a broad set of GitHub repositories.
+3. **Documentation Intelligence Tool**  
+   Scrapes and interprets repository documentation (e.g., README files and additional markdowns) to understand the purpose, setup, and key features.
 
-### 🔹 3. Documentation Intelligence Tool
-Summarizes and interprets README files to understand the purpose, setup, and key features of each repository.
+4. **Codebase Mapping Tool**  
+   Analyzes the project’s file structure and technology stack to assess complexity, modularity, and suitability for the user’s needs.
 
-### 🔹 4. Codebase Mapping Tool
-Analyzes the project’s file structure and technology stack to assess complexity, modularity, and suitability for the user’s needs.
+5. **Community Insight Tool**  
+   Aggregates social signals such as stars, forks, issues, and pull request activity to gauge real-world engagement and maturity.
 
-### 🔹 5. Community Insight Tool
-Gathers social signals like stars, forks, issues, and pull request activity to gauge real-world engagement and maturity.
+6. **Relevance Synthesis Tool**  
+   Combines insights from all modules to compute a final relevance score tailored to the user query.
 
-### 🔹 6. Relevance Synthesis Tool
-Combines insights from all modules to compute a final relevance score tailored to the user query.
+7. **Insight Delivery Module**  
+   Presents a ranked list of repositories with concise summaries and justifications, enabling smart discovery.
 
-### 🔹 7. Insight Delivery Module
-Presents ranked repositories to the user with concise summaries and justifications — enabling smart discovery.
+---
 
 ## 🚀 Goals
 
-- Surface powerful but under-the-radar open-source tools.
-- Build an intelligent layer over GitHub for research-focused developers.
-- Open-source the entire workflow to promote transparent research.
+- **Uncover Hidden Gems:**  
+  Surface powerful but under-the-radar open-source tools.
+
+- **Empower Research:**  
+  Build an intelligent discovery layer over GitHub tailored for research-focused developers.
+
+- **Promote Open Innovation:**  
+  Open-source the entire workflow to foster transparency and collaboration in research.
+
+---
+
+## 🖥️ User Interface
+
+DeepGit provides an intuitive interface for exploring repository recommendations. The main page where users enter raw natural language query. This is the primary interaction point for initiating deep semantic searches.
+
+<p align="center">
+  <img src="assets/dashboard.png" alt="DeepGit Dashboard" style="max-width: 800px; width: 100%; height: auto;" />
+</p>
+
+*Output:* Showcases the tabulated results with clickable links and different threshold scores, making it easy to compare and understand the ranking criteria.
+
+
+<p align="center">
+  <img src="assets/output.png" alt="DeepGit App UI" style="max-width: 800px; width: 100%; height: auto;" />
+</p>
 
 ---
 
-Want to contribute or give feedback? Reach out or open an issue!
+### 🛠️ Running DeepGit
+
+For a detailed documentation on using DeepGit, Check out [here](docs).
+
+DeepGit leverages Langgraph for orchestration. To launch the Langsmith dashboard and start the workflow, simply run:
+
+```bash
+langgraph dev
+```
+This command opens the Langsmith dashboard where you can enter your raw queries in a JSON snippet and monitor the entire agentic workflow.
+
+### 🚀 Running DeepGit via App
+
+To run DeepGit locally, simply execute:
+
+```bash
+python app.py
+```
 
+### DeepGit on Docker
+For instructions on using Docker with DeepGit, please refer to our [Docker Documentation](docs/docker.md).
diff --git a/agent.py b/agent.py
@@ -0,0 +1,141 @@
+import os
+import base64
+import requests
+import numpy as np
+import datetime
+import math
+import logging
+import getpass
+import faiss
+from pathlib import Path
+from dotenv import load_dotenv
+from sentence_transformers import SentenceTransformer, CrossEncoder
+from langgraph.graph import START, END, StateGraph
+from pydantic import BaseModel, Field
+from dataclasses import dataclass, field
+from typing import List, Any
+import subprocess
+import tempfile
+import shutil
+import stat
+
+# Import node functions from the tools directory.
+from tools.convert_query import convert_searchable_query
+from tools.github import ingest_github_repos
+from tools.dense_retrieval import hybrid_dense_retrieval
+from tools.cross_encoder_reranking import cross_encoder_reranking
+from tools.filtering import threshold_filtering
+from tools.activity_analysis import repository_activity_analysis
+from tools.decision_maker import decision_maker
+from tools.code_quality import code_quality_analysis
+from tools.merge_analysis import merge_analysis
+from tools.ranking import multi_factor_ranking
+from tools.output_presentation import output_presentation
+
+# ---------------------------
+# Logging and Environment Setup
+# ---------------------------
+logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
+logger = logging.getLogger(__name__)
+
+dotenv_path = Path(__file__).resolve().parent.parent / ".env"
+if dotenv_path.exists():
+    load_dotenv(dotenv_path)
+
+if "GITHUB_API_KEY" not in os.environ:
+    os.environ["GITHUB_API_KEY"] = getpass.getpass("Enter your GitHub API key: ")
+
+# ---------------------------
+# State and Configuration
+# ---------------------------
+@dataclass(kw_only=True)
+class AgentState:
+    user_query: str = field(default="I am researching the application of Chain of Thought prompting for improving reasoning in large language models within a Python environment.")
+    searchable_query: str = field(default="")
+    repositories: List[Any] = field(default_factory=list)
+    semantic_ranked: List[Any] = field(default_factory=list)
+    reranked_candidates: List[Any] = field(default_factory=list)
+    filtered_candidates: List[Any] = field(default_factory=list)
+    activity_candidates: List[Any] = field(default_factory=list)
+    quality_candidates: List[Any] = field(default_factory=list)
+    final_ranked: List[Any] = field(default_factory=list)
+    run_code_analysis: bool = field(default=False)
+
+@dataclass(kw_only=True)
+class AgentStateInput:
+    user_query: str = field(default="I am researching the application of Chain of Thought prompting for improving reasoning in large language models within a Python environment.")
+
+"""
+@dataclass(kw_only=True)
+class AgentStateOutput:
+    final_ranked: List[Any] = field(default_factory=list)
+"""
+
+@dataclass(kw_only=True)
+class AgentStateOutput:
+    final_results: str = ""
+
+class AgentConfiguration(BaseModel):
+    max_results: int = Field(default=100, title="Max Results", description="Maximum results to fetch from GitHub")
+    per_page: int = Field(default=25, title="Per Page", description="Results per page for GitHub API")
+    dense_retrieval_k: int = Field(default=100, title="Dense Retrieval Top K", description="Top K candidates to retrieve from FAISS")
+    cross_encoder_top_n: int = Field(default=50, title="Cross Encoder Top N", description="Top N candidates after re-ranking")
+    min_stars: int = Field(default=50, title="Minimum Stars", description="Minimum star count threshold for filtering")
+    cross_encoder_threshold: float = Field(default=5.5, title="Cross Encoder Threshold", description="Threshold for cross encoder score filtering")
+
+    sem_model_name: str = Field(default="all-mpnet-base-v2", title="Sentence Transformer Model", description="Model for dense retrieval")
+    cross_encoder_model_name: str = Field(default="cross-encoder/ms-marco-MiniLM-L-6-v2", title="Cross Encoder Model", description="Model for re-ranking")
+
+    @classmethod
+    def from_runnable_config(cls, config: Any = None) -> "AgentConfiguration":
+        configurable = config["configurable"] if config and "configurable" in config else {}
+        raw_values = {name: os.environ.get(name.upper(), configurable.get(name)) for name in cls.__fields__.keys()}
+        values = {k: v for k, v in raw_values.items() if v is not None}
+        return cls(**values)
+
+# -------------------------------------------------------
+# Build and Compile the Graph
+# -------------------------------------------------------
+builder = StateGraph(
+    AgentState,
+    input=AgentStateInput,
+    output=AgentStateOutput,
+    config_schema=AgentConfiguration
+)
+
+builder.add_node("convert_searchable_query", convert_searchable_query)
+builder.add_node("ingest_github_repos", ingest_github_repos)
+builder.add_node("neural_dense_retrieval", hybrid_dense_retrieval)
+builder.add_node("cross_encoder_reranking", cross_encoder_reranking)
+builder.add_node("threshold_filtering", threshold_filtering)
+builder.add_node("repository_activity_analysis", repository_activity_analysis)
+builder.add_node("decision_maker", decision_maker)
+builder.add_node("code_quality_analysis", code_quality_analysis)
+builder.add_node("merge_analysis", merge_analysis)
+builder.add_node("multi_factor_ranking", multi_factor_ranking)
+builder.add_node("output_presentation", output_presentation)
+
+builder.add_edge(START, "convert_searchable_query")
+builder.add_edge("convert_searchable_query", "ingest_github_repos")
+builder.add_edge("ingest_github_repos", "neural_dense_retrieval")
+builder.add_edge("neural_dense_retrieval", "cross_encoder_reranking")
+builder.add_edge("cross_encoder_reranking", "threshold_filtering")
+builder.add_edge("threshold_filtering", "repository_activity_analysis")
+builder.add_edge("threshold_filtering", "decision_maker")
+builder.add_edge("decision_maker", "code_quality_analysis")
+builder.add_edge("repository_activity_analysis", "merge_analysis")
+builder.add_edge("code_quality_analysis", "merge_analysis")
+builder.add_edge("merge_analysis", "multi_factor_ranking")
+builder.add_edge("multi_factor_ranking", "output_presentation")
+builder.add_edge("output_presentation", END)
+
+graph = builder.compile()
+
+if __name__ == "__main__":
+    initial_state = AgentStateInput(
+        user_query="I am researching the application of Chain of Thought prompting for improving reasoning in large language models within a Python environment. No need for code analysis."
+    )
+    result = graph.invoke(initial_state)
+    print(result["final_results"])
+
+# -------------------------------------------------------