8000 GitHub - Tejas242/ArQiv: ArQiv is a state-of-the-art search engine, meticulously designed for the ArXiv research corpus. By leveraging powerful data structures such as inverted indexes, tries, and bitmap indexes alongside advanced ranking algorithms like BM25, TF-IDF, Fast Vector Ranking, and BERT-based semantic search.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/ ArQiv Public

ArQiv is a state-of-the-art search engine, meticulously designed for the ArXiv research corpus. By leveraging powerful data structures such as inverted indexes, tries, and bitmap indexes alongside advanced ranking algorithms like BM25, TF-IDF, Fast Vector Ranking, and BERT-based semantic search.

License

Notifications You must be signed in to change notification settings

Tejas242/ArQiv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ ArQiv Search Engine

High-performance semantic search for ArXiv research papers

ArQiv Version

Features β€’ Getting Started β€’ Usage β€’ Architecture β€’ Benchmarks β€’ Live Demo

Python Streamlit NumPy Scikit-learn NLTK PyTorch Pandas Plotly


πŸ“– Overview

ArQiv is a state-of-the-art search engine designed specifically for the ArXiv research corpus. It combines multiple indexing strategies and ranking algorithms to deliver lightning-fast, relevant results that help researchers discover papers more efficiently.

Only first 1000 documents are loaded change the sample_size in loader.py file to load more docs or full dataset.

✨ Key Features

πŸ” Optimized Data Structures

  • Inverted Index: Rapid term lookups with positional data
  • Trie (Prefix Tree): Instant autocomplete and fuzzy matching
  • Bitmap Index: Ultra-fast Boolean operations with vectorization

πŸ“Š Advanced Ranking Algorithms

  • BM25 Ranking: Precise probabilistic relevance scoring
  • TF-IDF Ranking: Robust vectorized similarity computation
  • Fast Vector Ranking: Real-time ranking via NearestNeighbors
  • Optional BERT Ranking: Deep semantic ranking with transformers

πŸ–₯️ Multiple Interfaces

  • Rich CLI: Colorful terminal interface with detailed results
  • Streamlit Web App: Interactive web UI with visualizations
  • In-memory Query Caching: Near-instant response on repeated queries

⚑ Performance & Scalability

  • Parallel Processing: Multi-core indexing for large datasets
  • Modular Design: Extensible architecture for new features
  • Memory Efficient: Optimized for performance on standard hardware

πŸš€ Getting Started

Prerequisites

  • Python 3.7+
  • 4GB+ RAM (8GB recommended for full dataset)
  • Internet connection for initial dataset download

Installation

  1. Clone the repository:

    git clone https://github.com/tejas242/ArQiv.git
    cd arqiv
  2. Create a virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Download NLTK resources (one-time):

    import nltk
    nltk.download('stopwords')

πŸ“‹ Usage

Command Line Interface

Launch the rich CLI interface:

python cli.py

Streamlit Web Interface

Start the interactive web application:

cd streamlit
streamlit run streamlit_app.py

Or use the hosted version:

https://arqiv-search.streamlit.app/

CLI Demo

πŸ—οΈ Architecture

ArQiv employs a layered architecture with the following components:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          User Interfaces        β”‚
β”‚    CLI Interface  β”‚  Streamlit  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚         Ranking Algorithms      β”‚
β”‚  BM25  β”‚ TF-IDF β”‚ Vector β”‚ BERT β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚          Index Structures       β”‚
β”‚ Inverted Index β”‚ Trie β”‚ Bitmap  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚             Data Layer          β”‚
β”‚ Document Model β”‚ Preprocessing  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“Š Benchmarks

Task Performance
Index 1,000 documents 0.8 seconds
Boolean search < 5ms
BM25 ranking ~50-100ms
TF-IDF ranking < 5ms
Fast Vector ranking < 5ms
BERT ranking ~200ms

Measurements on Ryzen 3 CPU with 8GB RAM

⚠️ Limitations & Challenges

While ArQiv is designed to be powerful and efficient, users should be aware of these limitations:

  • Memory Usage: The in-memory index requires approximately 50MB per 1,000 documents
  • 8970
  • Scalability Ceiling: Performance may degrade with datasets beyond 100,000 documents without distributed architecture
  • Language Support: Currently optimized for English text only
  • Neural Features: BERT-based ranking requires substantial CPU resources (GPU recommended)
  • Cold Start: First-time initialization has higher latency while building indices
  • Preprocessing Effects: Stemming may occasionally lead to unexpected term matches

For large-scale deployment scenarios, consider implementing a sharded index architecture or using a database backend for the inverted index.

πŸ“ Project Structure

Click to expand directory structure
arqiv/
β”œβ”€β”€ data/               # Data handling components
β”‚   β”œβ”€β”€ document.py     # Document model
β”‚   β”œβ”€β”€ preprocessing.py # Text processing utilities
β”‚   └── loader.py       # Dataset loading
β”œβ”€β”€ index/              # Indexing structures
β”‚   β”œβ”€β”€ inverted_index.py # Main indexing engine
β”‚   β”œβ”€β”€ trie.py         # Prefix tree for autocomplete
β”‚   └── bitmap_index.py # Bitmap for fast boolean ops
β”œβ”€β”€ ranking/            # Ranking algorithms
β”‚   β”œβ”€β”€ bm25.py         # BM25 implementation
β”‚   β”œβ”€β”€ tfidf.py        # TF-IDF with scikit-learn
β”‚   β”œβ”€β”€ fast_vector_ranker.py # Vector-based ranking
β”‚   └── bert_ranker.py  # Neural ranking with BERT
β”œβ”€β”€ search/             # Search functionalities
β”‚   └── fuzzy_search.py # Approximate string matching
β”œβ”€β”€ streamlit/          # Web interface
β”‚   └── streamlit_app.py # Streamlit application
β”œβ”€β”€ docs/               # Documentation
β”œβ”€β”€ tests/              # Test suite
β”œβ”€β”€ cli.py              # Command-line interface
└── README.md           # This file

πŸ”§ Troubleshooting

Common issues and solutions

Dataset Download Issues

  • Ensure Kaggle API credentials are set up correctly
  • For manual download: Place the arxiv-metadata-oai-snapshot.json in the data/ directory

Performance Problems

  • For slow search: Try using BM25 or TF-IDF ranking for faster results
  • If indexing is slow: Increase the number of worker processes in the parallel option

Memory Usage

  • If experiencing memory issues: Reduce sample_size parameter in load_arxiv_dataset()

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


Built with ⚑ by Tejas Mahajan

About

ArQiv is a state-of-the-art search engine, meticulously designed for the ArXiv research corpus. By leveraging powerful data structures such as inverted indexes, tries, and bitmap indexes alongside advanced ranking algorithms like BM25, TF-IDF, Fast Vector Ranking, and BERT-based semantic search.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0