8000 GitHub - Khojanator/NYT_most_written_topics: Collects NYT data and finds the topics most written about using the keywords in the article through Hadoop Map Reduce framework
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Collects NYT data and finds the topics most written about using the keywords in the article through Hadoop Map Reduce framework

Notifications You must be signed in to change notification settings

Khojanator/NYT_most_written_topics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYT_most_written_topics

Collects NYT data and finds the topics most written about using the keywords in the article through Hadoop Map Reduce framework.

Creating the dataset

Before you begin, you will need an API key for NYT article search API. You can sign up for one here. For further documentation on the API results, please visit NYT developer or the Article Search API. Once you've retrieved the API key, edit the nyt_api_key.py and store the key as a string in place of Your_API_Key.

You can now run the python file to retrieve the dataset you're looking for with a given start and end date. This works well for time ranges of approximately a month. Example:

$ python getNYTarticleData.py -b 20180301 -e 20180331 -o nytDataMar2018.txt

Once the dataset has been created, you will need to put that in HDFS. This can be done via the put command in hdfs. Make sure you have hadoop bin directory in the $PATH. You can create a directory in HDFS for this.

$ hdfs dfs -mkdir Hadoop_NYT

$ hdfs dfs -put nytDataMar2018.txt /Hadoop_NYT

Running the MapReduce application to find most popular keyword(s)

After you have created the dataset of NY Times articles in your desired time range and put it in HDFS, you can run the KeywordNytCountDriver program to find the most popular keyword(s) in your dataset.

How to use

Make sure you have hadoop bin directory in the $PATH. In order to load HADOOP_CLASSPATH variable, run the following:

$ export HADOOP_CLASSPATH=$(hadoop classpath)

Now, we can compile the jar file as follows

$ javac -classpath ${HADOOP_CLASSPATH} KeywordNytCountDriver.java

$ jar cf nyt.jar Keyword*.class

$ hadoop jar nyt.jar KeywordNytCountDriver /Hadoop_NYT/nytDataMar2018.txt output/

Replace /Hadoop_NYT/nytDataMar2018.txt with your input file path.

To see the result, run the following command: $ hadoop fs -cat output/part-r-00000

The result will be the keyword with the highest number of articles written about it. If there are multiple keywords with the same highest number of articles, all of them will be returned. The result is in the following format:

<keyword> <url-1> <url-2> <url-3> <url-4> <url-5> <number-of-articles-with-keyword>

The five URLs returned with the keyword are the first five article URLs the program found during the keyword counting phase. It will return fewer than five URLs if the keyword doesn't have five articles.

About

Collects NYT data and finds the topics most written about using the keywords in the article through Hadoop Map Reduce framework

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
0