NewsBot is an AI-driven content analysis tool that enables users to extract insights from news articles and documents. Whether you paste a URL or upload a PDF, NewsBot intelligently processes the content, understands context, and answers your questions using powerful language models. Built with Streamlit, LangChain, and OpenAI, it combines real-time document parsing, semantic search with FAISS, and natural language understanding to deliver fast, accurate, and source-backed responses.
📎 Accepts URLs or PDF uploads as content sources
🔍 Splits and embeds text using OpenAI Embeddings
⚡ Fast and accurate Q&A using FAISS vector search
📚 Displays answers along with the original source
🧠 Powered by LangChain and OpenAI GPT models
┌─────────────────────────────┐
│ Streamlit UI │
│ - Sidebar: URL/PDF inputs │
│ - Main: Question + Answer │
└────────────┬────────────────┘
│
▼ Process Trigger
(when user clicks "Load Source")
│
▼
┌─────────────────────────────────────────────────────┐
│ Content Loading & Preprocessing │
│─────────────────────────────────────────────────────│
│ If URL(s): │
│ → UnstructuredURLLoader fetches & parses text │
│ │
│ If PDF: │
│ → PyPDF2 reads pages │
│ → LangChain `Document` created from text │
└────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Recursive Text Chunking (LangChain) │
│ - Uses RecursiveCharacterTextSplitter │
│ - Breaks content into 1000-token overlapping chunks │
└────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ OpenAI Embeddings + FAISS Indexing │
│ - Embeds chunks using OpenAIEmbeddings │
│ - Stores vectors in FAISS index │
│ - Saves (index, docstore, id_map) in pickle file │
└────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Question-Answer Inference Pipeline │
│ - User asks question │
│ - FAISS index loaded │
│ - Vector similarity search retrieves top docs │
│ - LangChain `RetrievalQAWithSourcesChain` runs LLM │
│ - GPT generates final answer using retrieved docs │
└────────────────────┬────────────────────────────────┘
│
▼
┌───────────────┐
│ UI Output │
│ - Answer │
│ - Source(s) │
└───────────────┘
- 🖼️ Image-Based Text Extraction
- Integrate Optical Character Recognition (OCR) to extract and analyze text from images embedded within documents or uploaded directly.
- 📝 Handwritten & Scanned Document Support
- Extend compatibility to scanned PDFs and handwritten content using OCR tools, enabling processing of a broader range of document types.
- 📚 Multi-Document Cross Analysis
- Allow users to submit multiple documents or articles simultaneously for comparison, aggregation, and context-aware question answering.
- 🧠 Conversational Memory (Chain Memory)
- Introduce memory components that preserve the context of previous interactions, enabling multi-turn dialogue and a more natural Q&A experience.
© Created by Vetrivel Maheswaran