8000 Merge dev into main – Implement DeepGit Enhancements, Workflow Optimizations, and Bug Fixes by zamalali · Pull Request #1 · zamalali/DeepGit · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Merge dev into main – Implement DeepGit Enhancements, Workflow Optimizations, and Bug Fixes #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Mar 31, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
6bcbc5a
Deep Research Integrated
Mar 22, 2025
ae9abc6
Requirements added
Mar 22, 2025
4179568
Add new binary files and update requirements and environment configur…
Mar 23, 2025
b628c87
Add new binary files, update environment variables, and implement cha…
Mar 24, 2025
c4a3059
Add new binary files, update README image, and fix typo in chat.py
Mar 24, 2025
52fcc32
Remove obsolete compiled Python bytecode file
Mar 24, 2025
656ce59
Ignore checkpoint files
Mar 24, 2025
b9a78ba
Add new binary files and update README image for improved documentation
Mar 24, 2025
81f7577
Add decision-making functionality for code analysis and update README…
Mar 26, 2025
8ab4fb0
Enhance README styling and update logo size for better visibility
Mar 26, 2025
360e483
Update README styling for improved visibility and aesthetics
Mar 26, 2025
2069403
Update logo image in README for improved visibility and add new logo …
Mar 27, 2025
3eef75c
Update README styling with new font and image adjustments for enhance…
Mar 27, 2025
4ee3900
Refactor README header styling and update logo for improved layout an…
Mar 27, 2025
bb4769d
Remove unused binary files and add new analysis and retrieval tools f…
Mar 28, 2025
50412d7
Add new DeepGit agent functionality with async code quality analysis …
Mar 28, 2025
52331ea
Remove .env and update .gitignore
Mar 28, 2025
946ae20
Track logo using Git LFS
Mar 28, 2025
1cfbaaf
initial commit
Mar 29, 2025
b656ada
Initial push to Hugging Face DeepGit Space 🚀
Mar 29, 2025
744410d
🚀 Push all DeepGit files and folders
Mar 29, 2025
532e4d3
Track binary assets with Git LFS
Mar 29, 2025
a98a4c6
Fix .gitattributes merge conflict
Mar 29, 2025
8acf1d3
Ignore .env file going forward
Mar 29, 2025
6f47260
Add environment variables, update requirements, and include binary as…
Mar 29, 2025
0807c05
Update dependencies, refactor retrieval methods, and enhance activity…
Mar 29, 2025
56e86f3
Add Dockerfile, update Gradio theme, and remove unused certificate
Mar 29, 2025
4add0b1
Remove merge conflict markers and clean up README.md content
Mar 29, 2025
edd481e
Update workflow to trigger on dev branch
Mar 29, 2025
bea08f0
Add full Docker build & push workflow for dev
Mar 29, 2025
d53054e
Fix DockerHub username for push
Mar 29, 2025
780a6ef
Add new certificate and update requirements for rank_bm25
Mar 29, 2025
125b0c3
Refactor app.py to remove redundant launch call in main block
Mar 29, 2025
a8a4f2f
Update app.py to use default theme and remove share option from launch
Mar 29, 2025
3060ab8
Add test files and update app.py theme configuration
Mar 29, 2025
1d15d2c
Remove unused files and assets; add testing documentation
Mar 31, 2025
7fb1da3
Update environment variables for security and remove deprecated deepg…
Mar 31, 2025
a67c542
Add Docker documentation and enhance README with usage instructions
Mar 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .env
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
GITHUB_API_KEY=your_github_api_key
HUGGINGFACE_TOKEN=your_huggingface_token
GROQ_API_KEY=your_groq_api_key
export PYDANTIC_V1_COMPATIBLE_MODE="true"
LANGSMITH_API_KEY=your_langsmith_api_key
32 changes: 32 additions & 0 deletions .github/workflows/docker-image.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Build Docker Image (DeepGit)

on:
workflow_dispatch: # allows manual trigger
push:
branches:
- dev # triggers on dev branch

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v3

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3

- name: Build Docker image
run: docker build -t deepgit-app .

- name: Log in to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}

- name: Tag and Push Docker image
run: |
docker tag deepgit-app at384/deepgit-app:latest
docker push at384/deepgit-app:latest
24 changes: 24 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Use a slim Python 3.10 image as the base
FROM python:3.10-slim

# Install system dependencies (if needed for building some Python packages)
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
&& rm -rf /var/lib/apt/lists/*

# Set the working directory in the container
WORKDIR /app

# Copy requirements.txt and install Python dependencies
COPY requirements.txt .
RUN pip install --upgrade pip && pip install -r requirements.txt

# Copy the rest of the code into the container
COPY . .

# Expose the default port for Gradio (if you want to access the app externally)
EXPOSE 7860

# Set the command to run your app
CMD ["python", "app.py"]
87 changes: 66 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,45 +3,90 @@
</h1>

<p alig F438 n="center">
<img src="assets/workflow.png" width="120%" alt="Workflow Diagram"/>
<img src="assets/flow.png" alt="Langgraph Workflow Diagram" style="max-width: 800px; width: 100%; height: auto;" />
</p>

## DeepGit

**DeepGit** is an autonomous agent designed to perform deep semantic research across GitHub repositories. It intelligently searches, analyzes, and ranks repositories based on user intent — even for less-known but highly relevant tools.
**DeepGit** is an advanced, Langgraph-based agentic workflow designed to perform deep research across GitHub repositories. It intelligently searches, analyzes, and ranks repositories based on user intent—even uncovering less-known but highly relevant tools. DeepGit infuses hybrid dense retrieval with advanced cross-encoder re-ranking and comprehensive activity analysis into a unified, open-source platform for intelligent repository discovery

---

## ⚙️ How It Works — Agentic Workflow

When a user submits a query, **DeepGit Orchestrator Agent** takes over. Here's the breakdown of the pipeline:
When a user submits a query, the **DeepGit Orchestrator Agent** takes over, passing the query through a series of specialized tools:

1. **Query Expansion Tool**
Enhances vague user queries using language models to add specificity and context, enabling more accurate downstream retrieval.

### 🔹 1. Query Expansion Tool
Enhances vague user queries using language models to add specificity and context — enabling more accurate downstream retrieval.
2. **Semantic Retrieval Tool**
Leverages cutting-edge embedding models to semantically match the enhanced query against a wide array of GitHub repositories.

### 🔹 2. Semantic Retrieval Tool
Uses state-of-the-art embedding models to semantically match the enhanced query against a broad set of GitHub repositories.
3. **Documentation Intelligence Tool**
Scrapes and interprets repository documentation (e.g., README files and additional markdowns) to understand the purpose, setup, and key features.

### 🔹 3. Documentation Intelligence Tool
Summarizes and interprets README files to understand the purpose, setup, and key features of each repository.
4. **Codebase Mapping Tool**
Analyzes the project’s file structure and technology stack to assess complexity, modularity, and suitability for the user’s needs.

### 🔹 4. Codebase Mapping Tool
Analyzes the project’s file structure and technology stack to assess complexity, modularity, and suitability for the user’s needs.
5. **Community Insight Tool**
Aggregates social signals such as stars, forks, issues, and pull request activity to gauge real-world engagement and maturity.

### 🔹 5. Community Insight Tool
Gathers social signals like stars, forks, issues, and pull request activity to gauge real-world engagement and maturity.
6. **Relevance Synthesis Tool**
Combines insights from all modules to compute a final relevance score tailored to the user query.

### 🔹 6. Relevance Synthesis Tool
Combines insights from all modules to compute a final relevance score tailored to the user query.
7. **Insight Delivery Module**
Presents a ranked list of repositories with concise summaries and justifications, enabling smart discovery.

### 🔹 7. Insight Delivery Module
Presents ranked repositories to the user with concise summaries and justifications — enabling smart discovery.
---

## 🚀 Goals

- Surface powerful but under-the-radar open-source tools.
- Build an intelligent layer over GitHub for research-focused developers.
- Open-source the entire workflow to promote transparent research.
- **Uncover Hidden Gems:**
Surface powerful but under-the-radar open-source tools.

- **Empower Research:**
Build an intelligent discovery layer over GitHub tailored for research-focused developers.

- **Promote Open Innovation:**
Open-source the entire workflow to foster transparency and collaboration in research.

---

## 🖥️ User Interface

DeepGit provides an intuitive interface for exploring repository recommendations. The main page where users enter raw natural language query. This is the primary interaction point for initiating deep semantic searches.

<p align="center">
<img src="assets/dashboard.png" alt="DeepGit Dashboard" style="max-width: 800px; width: 100%; height: auto;" />
</p>

*Output:* Showcases the tabulated results with clickable links and different threshold scores, making it easy to compare and understand the ranking criteria.


<p align="center">
<img src="assets/output.png" alt="DeepGit App UI" style="max-width: 800px; width: 100%; height: auto;" />
</p>

---

Want to contribute or give feedback? Reach out or open an issue!
### 🛠️ Running DeepGit

For a detailed documentation on using DeepGit, Check out [here](docs).

DeepGit leverages Langgraph for orchestration. To launch the Langsmith dashboard and start the workflow, simply run:

```bash
langgraph dev
```
This command opens the Langsmith dashboard where you can enter your raw queries in a JSON snippet and monitor the entire agentic workflow.

### 🚀 Running DeepGit via App

To run DeepGit locally, simply execute:

```bash
python app.py
```

### DeepGit on Docker
For instructions on using Docker with DeepGit, please refer to our [Docker Documentation](docs/docker.md).
141 changes: 141 additions & 0 deletions agent.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
import os
import base64
import requests
import numpy as np
import datetime
import math
import logging
import getpass
import faiss
from pathlib import Path
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer, CrossEncoder
from langgraph.graph import START, END, StateGraph
from pydantic import BaseModel, Field
from dataclasses import dataclass, field
from typing import List, Any
import subprocess
import tempfile
import shutil
import stat

# Import node functions from the tools directory.
from tools.convert_query import convert_searchable_query
from tools.github import ingest_github_repos
from tools.dense_retrieval import hybrid_dense_retrieval
from tools.cross_encoder_reranking import cross_encoder_reranking
from tools.filtering import threshold_filtering
from tools.activity_analysis import repository_activity_analysis
from tools.decision_maker import decision_maker
from tools.code_quality import code_quality_analysis
from tools.merge_analysis import merge_analysis
from tools.ranking import multi_factor_ranking
from tools.output_presentation import output_presentation

# ---------------------------
# Logging and Environment Setup
# ---------------------------
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

dotenv_path = Path(__file__).resolve().parent.parent / ".env"
if dotenv_path.exists():
load_dotenv(dotenv_path)

if "GITHUB_API_KEY" not in os.environ:
os.environ["GITHUB_API_KEY"] = getpass.getpass("Enter your GitHub API key: ")

# ---------------------------
# State and Configuration
# ---------------------------
@dataclass(kw_only=True)
class AgentState:
user_query: str = field(default="I am researching the application of Chain of Thought prompting for improving reasoning in large language models within a Python environment.")
searchable_query: str = field(default="")
repositories: List[Any] = field(default_factory=list)
semantic_ranked: List[Any] = field(default_factory=list)
reranked_candidates: List[Any] = field(default_factory=list)
filtered_candidates: List[Any] = field(default_factory=list)
activity_candidates: List[Any] = field(default_factory=list)
quality_candidates: List[Any] = field(default_factory=list)
final_ranked: List[Any] = field(default_factory=list)
run_code_analysis: bool = field(default=False)

@dataclass(kw_only=True)
class AgentStateInput:
user_query: str = field(default="I am researching the application of Chain of Thought prompting for improving reasoning in large language models within a Python environment.")

"""
@dataclass(kw_only=True)
class AgentStateOutput:
final_ranked: List[Any] = field(default_factory=list)
"""

@dataclass(kw_only=True)
class AgentStateOutput:
final_results: str = ""

class AgentConfiguration(BaseModel):
max_results: int = Field(default=100, title="Max Results", description="Maximum results to fetch from GitHub")
per_page: int = Field(default=25, title="Per Page", description="Results per page for GitHub API")
dense_retrieval_k: int = Field(default=100, title="Dense Retrieval Top K", description="Top K candidates to retrieve from FAISS")
cross_encoder_top_n: int = Field(default=50, title="Cross Encoder Top N", description="Top N candidates after re-ranking")
min_stars: int = Field(default=50, title="Minimum Stars", description="Minimum star count threshold for filtering")
cross_encoder_threshold: float = Field(default=5.5, title="Cross Encoder Threshold", description="Threshold for cross encoder score filtering")

sem_model_name: str = Field(default="all-mpnet-base-v2", title="Sentence Transformer Model", description="Model for dense retrieval")
cross_encoder_model_name: str = Field(default="cross-encoder/ms-marco-MiniLM-L-6-v2", title="Cross Encoder Model", description="Model for re-ranking")

@classmethod
def from_runnable_config(cls, config: Any = None) -> "AgentConfiguration":
configurable = config["configurable"] if config and "configurable" in config else {}
raw_values = {name: os.environ.get(name.upper(), configurable.get(name)) for name in cls.__fields__.keys()}
values = {k: v for k, v in raw_values.items() if v is not None}
return cls(**values)

# -------------------------------------------------------
# Build and Compile the Graph
# -------------------------------------------------------
builder = StateGraph(
AgentState,
input=AgentStateInput,
output=AgentStateOutput,
config_schema=AgentConfiguration
)

builder.add_node("convert_searchable_query", convert_searchable_query)
builder.add_node("ingest_github_repos", ingest_github_repos)
builder.add_node("neural_dense_retrieval", hybrid_dense_retrieval)
builder.add_node("cross_encoder_reranking", cross_encoder_reranking)
builder.add_node("threshold_filtering", threshold_filtering)
builder.add_node("repository_activity_analysis", repository_activity_analysis)
builder.add_node("decision_maker", decision_maker)
builder.add_node("code_quality_analysis", code_quality_analysis)
builder.add_node("merge_analysis", merge_analysis)
builder.add_node("multi_factor_ranking", multi_factor_ranking)
builder.add_node("output_presentation", output_presentation)

builder.add_edge(START, "convert_searchable_query")
builder.add_edge("convert_searchable_query", "ingest_github_repos")
builder.add_edge("ingest_github_repos", "neural_dense_retrieval")
builder.add_edge("neural_dense_retrieval", "cross_encoder_reranking")
builder.add_edge("cross_encoder_reranking", "threshold_filtering")
builder.add_edge("threshold_filtering", "repository_activity_analysis")
builder.add_edge("threshold_filtering", "decision_maker")
builder.add_edge("decision_maker", "code_quality_analysis")
builder.add_edge("repository_activity_analysis", "merge_analysis")
builder.add_edge("code_quality_analysis", "merge_analysis")
builder.add_edge("merge_analysis", "multi_factor_ranking")
builder.add_edge("multi_factor_ranking", "output_presentation")
builder.add_edge("output_presentation", END)

graph = builder.compile()

if __name__ == "__main__":
initial_state = AgentStateInput(
user_query="I am researching the application of Chain of Thought prompting for improving reasoning in large language models within a Python environment. No need for code analysis."
)
result = graph.invoke(initial_state)
print(result["final_results"])

# -------------------------------------------------------
Loading
0