The goal of this project is to explore reddit dog communities using natural language processing and unsupervised learning. Pulling one year's worth of data from the top seven highest-subscribed reddit dog breed communities, I used NLP and topic modeling techniques to identify topics most commonly discussed among subreddit communities, and derived "doggolingo" terms from the corpus. Finally, I built an app to allow users to explore the meaning of different doggolingo terms.
- Pulled all 2019 post and comment data from the seven most highly subscribed dog breed subreddits from Googe's BigQuery, in total covering roughly 80K reddit posts.
- Used SpaCy pipelines to preprocess the text corpus.
- Ran topic modeling on the corpus using Count Vectorizer, TF-IDF, LSA, NMF, LDA, and CorEx.
- Used a different a different SpaCy pipeline to pre-process the corpus to derive "doggolingo" words.
- Built a doggolingo exploration app using streamlit.
- all_breeds_topic_modeling notebook: data preprocessing and topic modeling.
- doggolingo notebook: data processing to derive "doggolingo" and wordcloud generation, as well as data cleaning and preparation for streamlit app.
- presentation pdf: final presentation for Metis program
- See the doggolingo explained repo for code and final data for the streamlit app.