🚀 ArQiv Search Engine

High-performance semantic search for ArXiv research papers

Features • Getting Started • Usage • Architecture • Benchmarks • Live Demo

📖 Overview

ArQiv is a state-of-the-art search engine designed specifically for the ArXiv research corpus. It combines multiple indexing strategies and ranking algorithms to deliver lightning-fast, relevant results that help researchers discover papers more efficiently.

Only first 1000 documents are loaded change the sample_size in loader.py file to load more docs or full dataset.

Try the live demo

✨ Key Features

🔍 Optimized Data Structures Inverted Index: Rapid term lookups with positional data Trie (Prefix Tree): Instant autocomplete and fuzzy matching Bitmap Index: Ultra-fast Boolean operations with vectorization	📊 Advanced Ranking Algorithms BM25 Ranking: Precise probabilistic relevance scoring TF-IDF Ranking: Robust vectorized similarity computation Fast Vector Ranking: Real-time ranking via NearestNeighbors Optional BERT Ranking: Deep semantic ranking with transformers
🖥️ Multiple Interfaces Rich CLI: Colorful terminal interface with detailed results Streamlit Web App: Interactive web UI with visualizations In-memory Query Caching: Near-instant response on repeated queries	⚡ Performance & Scalability Parallel Processing: Multi-core indexing for large datasets Modular Design: Extensible architecture for new features Memory Efficient: Optimized for performance on standard hardware

🚀 Getting Started

Prerequisites

Python 3.7+
4GB+ RAM (8GB recommended for full dataset)
Internet connection for initial dataset download

Installation

Clone the repository:

git clone https://github.com/tejas242/ArQiv.git
cd arqiv

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Download NLTK resources (one-time):
```
import nltk
nltk.download('stopwords')
```

📋 Usage

Command Line Interface

Launch the rich CLI interface:

python cli.py

Streamlit Web Interface

Start the interactive web application:

cd streamlit
streamlit run streamlit_app.py

Or use the hosted version:

https://arqiv-search.streamlit.app/

🏗️ Architecture

ArQiv employs a layered architecture with the following components:

┌─────────────────────────────────┐
│          User Interfaces        │
│    CLI Interface  │  Streamlit  │
├─────────────────────────────────┤
│         Ranking Algorithms      │
│  BM25  │ TF-IDF │ Vector │ BERT │
├─────────────────────────────────┤
│          Index Structures       │
│ Inverted Index │ Trie │ Bitmap  │
├─────────────────────────────────┤
│             Data Layer          │
│ Document Model │ Preprocessing  │
└─────────────────────────────────┘

📊 Benchmarks

Task	Performance
Index 1,000 documents	0.8 seconds
Boolean search	< 5ms
BM25 ranking	~50-100ms
TF-IDF ranking	< 5ms
Fast Vector ranking	< 5ms
BERT ranking	~200ms

Measurements on Ryzen 3 CPU with 8GB RAM

⚠️ Limitations & Challenges

While ArQiv is designed to be powerful and efficient, users should be aware of these limitations:

Memory Usage: The in-memory index requires approximately 50MB per 1,000 documents
Scalability Ceiling: Performance may degrade with datasets beyond 100,000 documents without distributed architecture
Language Support: Currently optimized for English text only
Neural Features: BERT-based ranking requires substantial CPU resources (GPU recommended)
Cold Start: First-time initialization has higher latency while building indices
Preprocessing Effects: Stemming may occasionally lead to unexpected term matches

For large-scale deployment scenarios, consider implementing a sharded index architecture or using a database backend for the inverted index.

📁 Project Structure

Click to expand directory structure

arqiv/
├── data/               # Data handling components
│   ├── document.py     # Document model
│   ├── preprocessing.py # Text processing utilities
│   └── loader.py       # Dataset loading
├── index/              # Indexing structures
│   ├── inverted_index.py # Main indexing engine
│   ├── trie.py         # Prefix tree for autocomplete
│   └── bitmap_index.py # Bitmap for fast boolean ops
├── ranking/            # Ranking algorithms
│   ├── bm25.py         # BM25 implementation
│   ├── tfidf.py        # TF-IDF with scikit-learn
│   ├── fast_vector_ranker.py # Vector-based ranking
│   └── bert_ranker.py  # Neural ranking with BERT
├── search/             # Search functionalities
│   └── fuzzy_search.py # Approximate string matching
├── streamlit/          # Web interface
│   └── streamlit_app.py # Streamlit application
├── docs/               # Documentation
├── tests/              # Test suite
├── cli.py              # Command-line interface
└── README.md           # This file

🔧 Troubleshooting

Common issues and solutions

Dataset Download Issues

Ensure Kaggle API credentials are set up correctly
For manual download: Place the arxiv-metadata-oai-snapshot.json in the data/ directory

Performance Problems

For slow search: Try using BM25 or TF-IDF ranking for faster results
If indexing is slow: Increase the number of worker processes in the parallel option

Memory Usage

If experiencing memory issues: Reduce sample_size parameter in load_arxiv_dataset()

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

_{Built with ⚡ by Tejas Mahajan}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.streamlit		.streamlit
analysis		analysis
assets		assets
cache		cache
data		data
docs		docs
index		index
ranking		ranking
search		search
streamlit		streamlit
tests		tests
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
__init__.py		__init__.py
benchmark.py		benchmark.py
cli.py		cli.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 ArQiv Search Engine

📖 Overview

✨ Key Features

🔍 Optimized Data Structures

📊 Advanced Ranking Algorithms

🖥️ Multiple Interfaces

⚡ Performance & Scalability

🚀 Getting Started

Prerequisites

Installation

📋 Usage

Command Line Interface

Streamlit Web Interface

🏗️ Architecture

📊 Benchmarks

⚠️ Limitations & Challenges

📁 Project Structure

🔧 Troubleshooting

Dataset Download Issues

Performance Problems

Memory Usage

🤝 Contributing

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Tejas242/ArQiv

Folders and files

Latest commit

History

Repository files navigation

🚀 ArQiv Search Engine

📖 Overview

✨ Key Features

🔍 Optimized Data Structures

📊 Advanced Ranking Algorithms

🖥️ Multiple Interfaces

⚡ Performance & Scalability

🚀 Getting Started

Prerequisites

Installation

📋 Usage

Command Line Interface

Streamlit Web Interface

🏗️ Architecture

📊 Benchmarks

⚠️ Limitations & Challenges

📁 Project Structure

🔧 Troubleshooting

Dataset Download Issues

Performance Problems

Memory Usage

🤝 Contributing

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages