Table of Contents
- 1. Overview
- 2. Design & Development
- 3. Challenges
- 4. Future Enhancements
- 5. Project Structure
- 6. References
The average retail investor interested in stock market investing often faces an overwhelming amount of information, making it challenging to gain quick, actionable insights into a company's stock performance and public sentiment. This information overload can lead to decision paralysis or uninformed investment choices.
The Market Pulse dashboard addresses this issue by providing a streamlined, user-friendly interface that offers immediate insight into a company's stock performance and public sentiment from online forums such as Reddit. This enables investors by providing them with a general sense of a company's performance relative to their industry and a guide to focus their more in-depth research.
The average retail investor who is interested in stock investing and is beginning to gather information about a prospective company they may want to invest in.
- Storage: AWS S3, Delta Lake, Parquet
- Data Processing: Databricks, Hugging Face, Apache Spark, dbt Core
- Data Visualization: Databricks Dashboards
- Orchestration: Astronomer/Airflow
- DevOps: GitHub, GitHub Actions
dashboard-demo.mp4
- API Rate Limits: The Reddit API has a rate limit of 100 queries per minute (QPM) per OAuth client ID for their Free tier leading to the need to introduce defined wait periods to avoid reaching the rate limit
- Infrastructure Integrations: When integrating Databricks with Astronomer and dbt, there were some difficulties because the Databricks instance was managed externally by the data bootcamp leading to additional research and prototyping being required to ensure proper connectivity.
- Data Visualization Limitations: Using Databricks Dashboards was a simple and appealing choice for building the dashboard since it is integrated very well with the Databricks platfrom, but due to the limited data visual options this made it diffcult to present the data, particularly the sentiment analysis data, in a compelling way. For example, it would have been more impactful if it was possible to display the sentiment data on the same visual as the stock price data to observe that relationship more closely and easily.
- Additional Data Souces: Integrate source data from addtional social media networks (e.g. Twitter, Blue Sky, etc.) for a more comprehensive and less biased sentiment analysis
- Additional Reddit Data: Add posts data from additional stock/investing related subreddits. Currently, data was scraped from four of the top stock/investing subreddits but this could be expanded to the top ten stock/investing subreddits for better
- Alternative Data Visualization Application: To improve the data visualization component of the solution, a more comprehensive and fully-featured application could be considered such as Apache Superset or Tableau. This would offer more compelling visuals to choose from leading to more value being extracted from the data and a better user experience.
market-pulse
│
├── .github
│ └── workflows -> GitHub Actions
├── astro -> Astronomer project
│ ├── Dockerfile -> The Astronomer Dockerfile for development and deployment
│ └── dags -> Directed Acyclic Graphs (DAGs) for Extract-Transform-Load (ETL) scripts
├── dashboard -> Databricks Dashboard
├── docs
│ └── images -> Images for README
├── etl
│ ├── ddls -> Table definitions for all data lakehouse tables
│ ├── extract -> Scripts to extract data from source systems
│ ├── load -> Scripts to load transformed data into the data lakehouse
│ ├── tests -> Scripts for testing data quality
│ ├── transform -> Scripts to transform data within the data lakehouse
│ └── utils -> Utility scripts used across the entire ETL process
├── market_pulse_dbt -> dbt project
│ ├── macros -> dbt macros
│ ├── models
│ │ └── gold -> dbt models for gold layer
│ └── tests
│ └── generic -> Custom generic tests
├── LICENSE.md
├── README.md
└── requirements.txt