This project implements a contextual chatbot that can answer questions based on uploaded documents. It uses FastAPI for the backend, Milvus for vector storage, and GPT-Neo for text generation.
-
Uploa 8000 d a document:
- Click on the "Choose File" button in the "Upload Document" section.
- Select a PDF or DOCX file from your computer.
- Click the "Upload" button to process and store the document.
-
Ask questions:
- Enter your question in the text box in the "Ask a Question" section.
- Click the "Submit" button to get a response from the chatbot.
-
View chat history:
- The chat history will be displayed in the section below the query form.
docker-compose up
python -m venv .venv
source .venv/bin/activate
pip install -r requirement.txt
sh startup.sh
- PDF -> Extract Text -> Chunking -> Embedding -> Save to DB
- query -> Embedding -> Search in DB -> Retrieve Releveent Chunks -> LLM -> Response
- FastAPI - Used for backend with two endpoints
POST /upload
: Upload and process a document (PDF or DOCX)POST /query
: To get response from the llm with query.
- Front-End
- keep a very simple index.html file which can use used for upload the document, and submit query. The conversational history will also be visible.
- Performance Evaluation
- Used ragas for synthetic data generation and performance evaluation
- Vector DB
- Used Milvus as a vector db
- Document Processing
- Used Langchain to parsing and processing the documents.
- LLM
- I have used openai to generate synthetic data (ragas backend), considering it is a one time things.
- As a RAG retriever you can use either hugging face model, ollama models and openai models currently. However the module can be easily modified to support any models.
- I have tested 3 models : 1. ollama llama 3.2 2. openai models 3. HF GPT Neo
- Ollama latency is poor but if we deploy ollama on the server it will be faster. Accuracy wise it is good.
- OpenAI model is accurate with good latency.
- To Use ollama need to setup ollama locally
- To Use openai need to setup api key in the environment
- keep all the pdf in the /data/pdf folder
- python evaluation/systhetic_data_generation.py
- python evaluation/eval.py
python evaluation/synthetic_data_generate.py --pdf_directory="data/pdfs" --num_questions=12
- Run
python evaluation/systhetic_data_generation.py
to generate the synthetic test question-answer pairs. - Run the service and then
python evaluation/eval.py
to evaluate the service accuracy.
New pdf -> Generate QA pairs -> Evaluate -> Update the config -> Evaluate
- Generate QA pairs (evaluation/systhetic_data_generation.py)
- Evaluate (evaluation/eval.py)
- Check the score (score.csv) and update the config (config.py)
- Repeat 2 and 3
- testset.csv -> evaluation/systhetic_data_generation.py -> Generate the question answer pairs (GT)
- question_answer.csv -> Intermediary file
- score.csv -> evaluation/eval.py -> predict and answer and evaluate the model by comparing with GT.
- config.json -> config.py -> Store the configuration
{
"chunk_size": 1000, # embedding size of the chunk
"model_name": "all-MiniLM-L6-v2", # emdedding models
"embedding_dim": 384,
"top_k": 3, # num of relevent chunk as a context for retriver
"retrival_model": "openai" # openai/ollama/gpt-neo
}
- Flow: pdf -> Extract Text -> Chunking -> Embedding -> Save to DB -> query -> Embedding -> Search in DB -> Retrieve Chunks -> Generate Response
- Build a boilerplate application
- Refactor the code to seperate db and model
- add logger and timer
- Setup evaluation pipeline
- Prepare question answer pairs for evaluation
- Setup mlops pipeline, versioning:
- Things to track (Store in DB)
- Chunk Size
- Embedding Model name
- Embedding dimension
- Things to track (Retrieve):
- LLM model name
- Latency
- Accuracy
- Cost
- Things to track (Store in DB)
- Streaming response setup
- Unit Testing Response
- gpt x - accuracy not good, respons time ~ 30s, token/s=24.27
- ollama(llama 2.1) - accuracy good, response time - 30s, token/s = 1.9
- gpt4o-mini- accuracy good, response time 4.6s, token/s = 10.6