8000 GitHub - duzx16/SearchTHU: Course project for Fundamentals of Search Engine Technology in Tsinghua University
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

duzx16/SearchTHU

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SearchTHU

Source code of the course project for Fundamentals of Search Engine Technology in Tsinghua University by Zhengxiao Du and Zhouxing Shi.

Requirements

  • Python 3.5+
  • Java 1.8
  • Node 10+
  • Npm 6+
  • Tomcat 9.0.21

Deployment Guide

Data

Data of format .html, .pdf, .docx or .doc are crawled with Heritrix. They are too large (>20G) and thus not included here, but can be provided upon request.

Preprocessing

To preprocess files under the <dir> directory, run:

cd preprocess
pip install -r requirements.txt
python DocParser.py <dir>

For each file with path <path>, it outputs a processed JSON file <path>.json.

Backend

First build the index by running java -jar backend/out/artifacts/Indexer_jar/SearchTHU.jar <data_dir> <index_dir>

Then put backend/out/artifacts/SearchTHU_war/SearchTHU_war.war at webapps/ of the Tomcat directory, and then start the Tomcat service.

The API path is http://hostname:port/SearchTHU_war/. See the API details at doc/api.md.

Frontend

To start the frontend service:

cd frontend
npm install
npm start

And then go to http://localhost:8080 to start using the application.

About

Course project for Fundamentals of Search Engine Technology in Tsinghua University

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
0