Collects NYT data and finds the topics most written about using the keywords in the article through Hadoop Map Reduce framework.
Before you begin, you will need an API key for NYT article search API. You can sign up for one here. For further documentation on the API results, please visit NYT developer or the Article Search API. Once you've retrieved the API key, edit the nyt_api_key.py and store the key as a string in place of Your_API_Key.
You can now run the python file to retrieve the dataset you're looking for with a given start and end date. This works well for time ranges of approximately a month. Example:
$ python getNYTarticleData.py -b 20180301 -e 20180331 -o nytDataMar2018.txt
Once the dataset has been created, you will need to put that in HDFS. This can be done via the put command in hdfs. Make sure you have hadoop bin directory in the $PATH. You can create a directory in HDFS for this.
$ hdfs dfs -mkdir Hadoop_NYT
$ hdfs dfs -put nytDataMar2018.txt /Hadoop_NYT
After you have created the dataset of NY Times articles in your desired time range and put it in HDFS, you can run the KeywordNytCountDriver program to find the most popular keyword(s) in your dataset.
Make sure you have hadoop bin directory in the $PATH. In order to load HADOOP_CLASSPATH variable, run the following:
$ export HADOOP_CLASSPATH=$(hadoop classpath)
Now, we can compile the jar file as follows
$ javac -classpath ${HADOOP_CLASSPATH} KeywordNytCountDriver.java
$ jar cf nyt.jar Keyword*.class
$ hadoop jar nyt.jar KeywordNytCountDriver /Hadoop_NYT/nytDataMar2018.txt output/
Replace /Hadoop_NYT/nytDataMar2018.txt
with your input file path.
To see the result, run the following command:
$ hadoop fs -cat output/part-r-00000
The result will be the keyword with the highest number of articles written about it. If there are multiple keywords with the same highest number of articles, all of them will be returned. The result is in the following format:
<keyword> <url-1> <url-2> <url-3> <url-4> <url-5> <number-of-articles-with-keyword>
The five URLs returned with the keyword are the first five article URLs the program found during the keyword counting phase. It will return fewer than five URLs if the keyword doesn't have five articles.