8000 Codebase indexing (Clean history) by daniel-lxs · Pull Request #3137 · RooCodeInc/Roo-Code · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Codebase indexing (Clean history) #3137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 71 commits into from
May 24, 2025

Conversation

daniel-lxs
Copy link
Collaborator
@daniel-lxs daniel-lxs commented May 3, 2025

Context

This PR implements codebase indexing using OpenAI embeddings models and Qdrant as a vector storage.
Also allows a new tool codebase_search which allows Roo to use natural language to search the code indexed.

The implementation reuses the tree sitter queries already present in the project.

Implementation

The implementation uses the tree sitter to parse code files, the list of compatible files is already defined in the project and is based on the queries already implemented. The code segments have to comply with a certain size to be indexed, if the code segment is too big, the parser will instead process the each individual children instead.

The quality of the code segments depends on the tree sitter queries already defined.

After the initial indexing, the codebase indexing service will start a file watcher to watch for any changes in the files it already indexed, this includes deleted files.

Screenshots

image
image

How to Test

  1. Go to Settings and scroll down, setup your openAI key and Qdrant url and click Save
  2. After the service is fully initialized the codebase_search tool will be made available to new tasks.

Note: The tool is conditionally added to the system prompt, if the service is disabled the tool will be completely missing from the system prompt.

Get in Touch

You know where to find me.


Important

Implements codebase indexing with OpenAI embeddings and Qdrant, adding a codebase_search tool for natural language code search, with comprehensive configuration and UI support.

  • Behavior:
    • Implements codebase indexing using OpenAI embeddings and Qdrant for vector storage.
    • Adds codebase_search tool for natural language code search.
    • Uses tree sitter for parsing code files, handling large segments by processing children.
    • Watches for file changes post-indexing, including deletions.
  • Configuration:
    • Adds CodeIndexConfigManager for managing configuration in config-manager.ts.
    • Supports configuration via CodeIndexSettings in CodeIndexSettings.tsx.
    • Validates configuration using zod schemas.
  • Components:
    • CodeIndexManager orchestrates indexing and searching.
    • QdrantVectorStore handles vector operations with Qdrant.
    • CodeParser and DirectoryScanner parse and scan code files.
    • FileWatcher monitors file changes.
  • UI:
    • CodebaseSearchResultsDisplay and CodebaseSearchResult components display search results.
    • Updates SettingsView to include codebase indexing settings.
  • Misc:
    • Updates ExtensionMessage and WebviewMessage to handle new message types.
    • Adds new tool descriptions and parameters in tools.ts.

This description was created by Ellipsis for 9f96628. You can customize this summary. It will automatically update as commits are pushed.

Copy link
changeset-bot bot commented May 3, 2025

⚠️ No Changeset found

Latest commit: d2ee9fa

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@daniel-lxs daniel-lxs changed the title Codebase indexing Codebase indexing (Clean history) May 3, 2025
@bitnom
Copy link
bitnom commented May 6, 2025

Just saw this and haven't looked at the code yet but first thought:

  • Don't want to use openai embeddings.
  • Don't want to use qdrant.

I have a lot of past experience with both. They're fine but not what I would expect to be integrated in an OSS project such as this.

@daniel-lxs
Copy link
Collaborator Author

Thank you for your feedback @bitnom. Let me respond to your points

Don't want to use openai embeddings.

It also supports Ollama

Don't want to use qdrant.

You didn't offer any alternatives, at this point changing the vector store shouldn't be too much work, but as you might understand, I would need some good reasons to change the vector store when I'm about finished with this.

@hannesrudolph hannesrudolph moved this from New to PR [Pre Approval Review] in Roo Code Roadmap May 7, 2025
@rapus95
Copy link
rapus95 commented May 8, 2025

You didn't offer any alternatives, at this point changing the vector store shouldn't be too much work, but as you might understand, I would need some good reasons to change the vector store when I'm about finished with this.

While I'm totally new to this topic and can't offer any personal experiences with it, and thus might be falling for some marketing claims, I recently searched for a graph db for some RAG sytem. There I stumbled over FalkorDB. Extended RedisGraph, based on Redis as far as I understood it.

From there: https://www.falkordb.com/blog/vector-database-vs-graph-database/

FalkorDB is a low-latency graph database graph with select vector capabilities. It offers high-speed performance for both graph traversals and vector similarity searches.

And from what I read elsewhere, Graph databases hold huge potential as well. So having an integrated solution that can support both might have even more benefits in the long run if we ever decide to put the tasks and their interaction into some graph structure, augmented with code references etc. It's open source and has even a free hosted tier for those who don't care about why it's free as long as it's free :D

Edit: And, again without personal experience and totally unrelated, a framework to throw into the ring: https://ts.llamaindex.ai (there's also a python variant, but given the codebase here is Typescript I guess the TS variant to be more relevant)

@v0idRift
Copy link

What about adding gemini-embedding-exp-03-07?

@daniel-lxs
Copy link
Collaborator Author

@v0idRift
Other embeddings providers can be added later, for this PR I'm just trying to add the base functionality

@Shanmukh-C
Copy link

@daniel-lxs really likes this work, tried giving a hands on using it (even though I know it is still in development phase).
Getting this below error : ERR CodeIndexManager not initialized. Call initialize() first.: Error: CodeIndexManager not initialized. Call initialize() first.

Any hint for why this is happening & how can this be resolved?

@hannesrudolph hannesrudolph moved this from PR [Pre Approval Review] to PR [Draft/WIP] in Roo Code Roadmap May 10, 2025
@daniel-lxs
Copy link
Collaborator Author

@Shanmukh-C you can go ahead and try again, let me know if the issue is fixed

@daniel-lxs daniel-lxs force-pushed the codebase-indexing-clean branch from cca718a to 8c23329 Compare May 11, 2025 18:40
@hannesrudolph
Copy link
Collaborator
  1. Pasted_Image_2025-05-13__1_42 PM

  2. Set default Qdrant URL to localhost:6333

  3. Disable "Start Indexing" button until a model is selected

  4. "Clear Index Data" should show whether any indexing data exists and how much space it's using

@hannesrudolph
Copy link
Collaborator

@mrubens I have reviewed this and @daniel-lxs has some bugs to squash. Once he is done he is going to add the tool to the workflow and see how well operates automatically. Right now it seemed to work quite well for exploring the codebase and was quite smooth once I was able to get it working.

@daniel-lxs
Copy link
Collaborator Author

I took note of the bugs that @hannesrudolph was getting, but overall I think when he actually got the tool working it was pretty smooth.

I'll try to get some time to work on it this week.

@v0idRift
Copy link

Please take a look at Gemini Embedding Experimental 03-07 if you have some time to play with it, this model looks perspective, i know more models will be added in feature, TY <3

@daniel-lxs daniel-lxs force-pushed the codebase-indexing-clean branch from 446178e to 9f96628 Compare May 15, 2025 23:04
@daniel-lxs daniel-lxs marked this pull request as ready for review May 15, 2025 23:04
@daniel-lxs daniel-lxs requested review from mrubens and cte as code owners May 15, 2025 23:04
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels May 15, 2025
@hannesrudolph
Copy link
Collaborator
hannesrudolph commented May 16, 2025

image
image

ITS ALIVE!!

@hannesrudolph hannesrudolph moved this from PR [Draft/WIP] to PR [Pre Approval Review] in Roo Code Roadmap May 16, 2025
@hannesrudolph
Copy link
Collaborator
Clipboard-20250514-004450-720.1.mp4
indexingWOOOOOOO.mp4

Here are two examples showing this working. You might need to pause them to see results

@daniel-lxs daniel-lxs force-pushed the codebase-indexing-clean branch from 269f7b2 to 2331cf3 Compare May 23, 2025 22:33
export const MAX_SEARCH_RESULTS = 50 // Maximum number of search results to return

/**File Watcher */
export const QDRANT_CODE_BLOCK_NAMESPACE = "f47ac10b-58cc-4372-a567-0e02b2c3d479"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, where does this come from?

Copy link
Collaborator Author
@daniel-lxs daniel-lxs May 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think of this as a salt, it basically let's me generate deterministic uuids from the file paths so I can use them to overwrite changed files using the file path and prevent them from being orphaned.
The reason I need deterministic uuids is that qdrant doesn't allow me to save any string as an ID, only an uuid.

@@ -0,0 +1,4 @@
import { extensions as allExtensions } from "../../tree-sitter"

// Filter out markdown extensions for the scanner
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just because markdown results aren’t that useful?

Copy link
Collaborator Author
@daniel-lxs daniel-lxs May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an issue I had at the beginning, our markdown parser wasn't really parsing markdown in a meaningful way, however I do have a simple parser now that works as a fallback and I should be able to re enable markdown parsing.

Copy link
Collaborator
@mrubens mrubens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💥

@mrubens
Copy link
Collaborator
mrubens commented May 23, 2025

Can you fix the test and then we can merge? 🙏

@hannesrudolph
Copy link
Collaborator

we have a winner!!

@hannesrudolph hannesrudolph merged commit 61e122d into RooCodeInc:main May 24, 2025
12 checks passed
@github-project-automation github-project-automation bot moved this from PR [Pre Approval Review] to Done in Roo Code Roadmap May 24, 2025
@github-project-automation github-project-automation bot moved this from PR [Pre Approval Review] to Done in Roo Code Roadmap May 24, 2025
@seedlord
Copy link

i have problems making ollama work on windows 10 OS.
the ollama log will show decode: cannot decode batches with this context (use llama_encode() instead)
changing line 38 in src/service/code-index/ollama.ts to input: text makes it start, but there is no indexed code. the index will be blank.
any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

10 participants
0