-
for this to-work, install the libraries listed in code-chunker-req.txt (possibly create a new conda env)
-
To create faiss-vector indexing and creating documents, refer the code - in code-chunker/pg.ipynb
-
to test the new created faiss-vector index and documents, refer to the code in pg.ipynb
Query based Code Generation and Analysis of Tree-Sequence using LLM.
The goal is to leverage Large-language Models(LLM) to generate code and analyze tree-sequences using tskit by simply asking questions in plain English. With Retrieval-Augmented Generation (RAG), users can input questions in plain English, and the system will generate executable tskit code to answer these queries.
In this initial proof-of-concept, the tskit source code is used as a knowledge base for the Large Language Model (LLM). When users input queries in natural language, the LLM generates the appropriate code based on the knowledge and returns a python function as a response.
Current version is a naive prompt:answer
approach which does not evaluate the accuracy of the generated code.
- Code generation can be improved using Flow Engineering Approach. Use LangGraph and openai Function Calling to setup the workflow.
- Code execution with error checking.
- Multiple Iterations.
- Terminal chat interface / UI interface (flask-reactjs)
- human-in-the-loop. (human intervention to review the code or correct it.)
- Additional node(tool) to ask general tree-sequence question that are not related to code-generation.
- Accuracy/reliability of the generated answer.
- How to enhance treesequence analysis. one way is MemoRAG. Memory-based knowledge discovery for long contexts.