The project is based on docker. So in order to run it, one must install it. Here are descriptions for installing Docker on Mac, Linux and Windows.
- Install docker and docker-compose
- Download the data files from here and un-tar it.
- Unzip the "gaming" zip-file and place it in resources so that the path is
like
resources/gaming/<all the json-files>
- Run Elasticsearch by issuing
docker-compose up -d elasticsearch
- Build all the packages by issuing
docker-compose build
- Index the questions by issuing
docker-compose run --rm indexer gaming answers
- Index the questions by issuing
docker-compose run --rm indexer gaming questions
- Run the server by issuing
docker-compose up -d server
- Open the html-client (client/index.html) in a web-browser and search for something.
From project description:
There are so many forums out there where people answer each other’s questions. Do we really still need people to do that? Aren't most questions answered already? For some topics it sure feels like all questions must have been answered by now. By eliminating the need for human interaction, your answer should be available way quicker! The idea behind this project is to create a page where you state your question and get an automatic answer (within a specific area) automatically, by indexing questions and answers already written by others in a forum.
The assignment:
- Fetch questions and answers ( I have a large batch of familjeliv.se data collected that can be used).
- Use elasticsearch ( https://github.com/elastic/elasticsearch), or another search engine of your choice, to index the questions and answers.
- Build a query analyzer that can help you create a query to your search engine by, for example, selecting the most important words, or comparing the entir e question to questions in the index.
- Query your search engine and select the best answer available.
- Present answer in a web interface.
An interesting extension is to try to devise a scoring function that determines how sure the system is of the answer.
indexer/ contains the part that reads raw data and puts it in Elasticsearch.
server/ contains the actual searcher, it also contains some form of http-server in order to be able to communicate with the client.
client/ contains the front-end that connects to the server and displays the answer to a given question.
report/ contains the report (for now written in LaTeX)
Docker should be able to help us run things without getting weird dependency bugs. Setting up docker and running the hello world takes 30 minutes at most. This page was very helpful in explaining how to write Dockerfiles and how to run containers.
To compile with docker-compose one can issue the command docker-compose build <packagename>
. In order to build everything just leave out the package name.
To get Elasticsearch up and running install docker-compose and issue the command:
docker-compose start elasticsearch
in the root of the project (where the docker-compose.yml file is).
Note: The start
command might not work first time running the docker-compose,
in that case try docker-compose up -d elasticsearch
.
To run indexer, make sure that Elasticsearch container is running. This can be
done by issuing docker ps
.
To index, check docker-compose run --rm indexer --help
Note the --rm
this flag is present so that docker does not create one container per execution but removes it after indexer has shut down.
Please download CQADupStack, put the uncompressed to resources (see the indexer.py, and you'll know what I mean)
-
Use topic model (trained offline) to decide the probability over all forums that the query might belong in, and search those forums which probability > threshold (but need to index all 14 forums first, which might take roughly 3 hr)
-
Well-design the query in
find_similar_query()
-
Use link_analysis like
PageRank
to give global score of each user(id), consider this to our question ranking -
Discard comment at the moment, maybe they are useful?
- Scripts for querying CQADupStack data
- Supervised Learning of Universal Sentence Representations