NYT_most_written_topics

Collects NYT data and finds the topics most written about using the keywords in the article through Hadoop Map Reduce framework.

Creating the dataset

Before you begin, you will need an API key for NYT article search API. You can sign up for one here. For further documentation on the API results, please visit NYT developer or the Article Search API. Once you've retrieved the API key, edit the nyt_api_key.py and store the key as a string in place of Your_API_Key.

You can now run the python file to retrieve the dataset you're looking for with a given start and end date. This works well for time ranges of approximately a month. Example:

$ python getNYTarticleData.py -b 20180301 -e 20180331 -o nytDataMar2018.txt

Once the dataset has been created, you will need to put that in HDFS. This can be done via the put command in hdfs. Make sure you have hadoop bin directory in the $PATH. You can create a directory in HDFS for this.

$ hdfs dfs -mkdir Hadoop_NYT

$ hdfs dfs -put nytDataMar2018.txt /Hadoop_NYT

Running the MapReduce application to find most popular keyword(s)

After you have created the dataset of NY Times articles in your desired time range and put it in HDFS, you can run the KeywordNytCountDriver program to find the most popular keyword(s) in your dataset.

How to use

Make sure you have hadoop bin directory in the $PATH. In order to load HADOOP_CLASSPATH variable, run the following:

$ export HADOOP_CLASSPATH=$(hadoop classpath)

Now, we can compile the jar file as follows

$ javac -classpath ${HADOOP_CLASSPATH} KeywordNytCountDriver.java

$ jar cf nyt.jar Keyword*.class

$ hadoop jar nyt.jar KeywordNytCountDriver /Hadoop_NYT/nytDataMar2018.txt output/

Replace /Hadoop_NYT/nytDataMar2018.txt with your input file path.

To see the result, run the following command: $ hadoop fs -cat output/part-r-00000

The result will be the keyword with the highest number of articles written about it. If there are multiple keywords with the same highest number of articles, all of them will be returned. The result is in the following format:

<keyword> <url-1> <url-2> <url-3> <url-4> <url-5> <number-of-articles-with-keyword>

The five URLs returned with the keyword are the first five article URLs the program found during the keyword counting phase. It will return fewer than five URLs if the keyword doesn't have five articles.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.gitignore		.gitignore
KeywordNytCountDriver.java		KeywordNytCountDriver.java
README.md		README.md
getNYTarticleData.py		getNYTarticleData.py
nyt_api_key.py		nyt_api_key.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NYT_most_written_topics

Creating the dataset

Running the MapReduce application to find most popular keyword(s)

How to use

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Khojanator/NYT_most_written_topics

Folders and files

Latest commit

History

Repository files navigation

NYT_most_written_topics

Creating the dataset

Running the MapReduce application to find most popular keyword(s)

How to use

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages