High-performance semantic search for ArXiv research papers
Features β’ Getting Started β’ Usage β’ Architecture β’ Benchmarks β’ Live Demo
ArQiv is a state-of-the-art search engine designed specifically for the ArXiv research corpus. It combines multiple indexing strategies and ranking algorithms to deliver lightning-fast, relevant results that help researchers discover papers more efficiently.
Only first 1000 documents are loaded change the sample_size in
loader.py
file to load more docs or full dataset.
|
|
|
|
- Python 3.7+
- 4GB+ RAM (8GB recommended for full dataset)
- Internet connection for initial dataset download
-
Clone the repository:
git clone https://github.com/tejas242/ArQiv.git cd arqiv
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Download NLTK resources (one-time):
import nltk nltk.download('stopwords')
Launch the rich CLI interface:
python cli.py
Start the interactive web application:
cd streamlit
streamlit run streamlit_app.py
Or use the hosted version:
https://arqiv-search.streamlit.app/
ArQiv employs a layered architecture with the following components:
βββββββββββββββββββββββββββββββββββ
β User Interfaces β
β CLI Interface β Streamlit β
βββββββββββββββββββββββββββββββββββ€
β Ranking Algorithms β
β BM25 β TF-IDF β Vector β BERT β
βββββββββββββββββββββββββββββββββββ€
β Index Structures β
β Inverted Index β Trie β Bitmap β
βββββββββββββββββββββββββββββββββββ€
β Data Layer β
β Document Model β Preprocessing β
βββββββββββββββββββββββββββββββββββ
Task | Performance |
---|---|
Index 1,000 documents | 0.8 seconds |
Boolean search | < 5ms |
BM25 ranking | ~50-100ms |
TF-IDF ranking | < 5ms |
Fast Vector ranking | < 5ms |
BERT ranking | ~200ms |
Measurements on Ryzen 3 CPU with 8GB RAM
While ArQiv is designed to be powerful and efficient, users should be aware of these limitations:
- Memory Usage: The in-memory index requires approximately 50MB per 1,000 documents 8970
- Scalability Ceiling: Performance may degrade with datasets beyond 100,000 documents without distributed architecture
- Language Support: Currently optimized for English text only
- Neural Features: BERT-based ranking requires substantial CPU resources (GPU recommended)
- Cold Start: First-time initialization has higher latency while building indices
- Preprocessing Effects: Stemming may occasionally lead to unexpected term matches
For large-scale deployment scenarios, consider implementing a sharded index architecture or using a database backend for the inverted index.
Click to expand directory structure
arqiv/
βββ data/ # Data handling components
β βββ document.py # Document model
β βββ preprocessing.py # Text processing utilities
β βββ loader.py # Dataset loading
βββ index/ # Indexing structures
β βββ inverted_index.py # Main indexing engine
β βββ trie.py # Prefix tree for autocomplete
β βββ bitmap_index.py # Bitmap for fast boolean ops
βββ ranking/ # Ranking algorithms
β βββ bm25.py # BM25 implementation
β βββ tfidf.py # TF-IDF with scikit-learn
β βββ fast_vector_ranker.py # Vector-based ranking
β βββ bert_ranker.py # Neural ranking with BERT
βββ search/ # Search functionalities
β βββ fuzzy_search.py # Approximate string matching
βββ streamlit/ # Web interface
β βββ streamlit_app.py # Streamlit application
βββ docs/ # Documentation
βββ tests/ # Test suite
βββ cli.py # Command-line interface
βββ README.md # This file
Common issues and solutions
- Ensure Kaggle API credentials are set up correctly
- For manual download: Place the arxiv-metadata-oai-snapshot.json in the data/ directory
- For slow search: Try using BM25 or TF-IDF ranking for faster results
- If indexing is slow: Increase the number of worker processes in the parallel option
- If experiencing memory issues: Reduce sample_size parameter in load_arxiv_dataset()
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add some amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Built with β‘ by Tejas Mahajan