This project is part of the IBM Data Science Professional Certificate on Coursera. It focuses on analyzing data from the Greater Taipei area to identify the best locations for opening a new venue, such as a restaurant, based on factors like population density, economic diversity, and existing venues.
- Project Overview
- Files and Resources
- Data Collection
- Methodology
- Results
- Technologies Used
- Installation
- Conclusion
Taipei is one of the most densely populated cities in the world, and it presents significant business opportunities for new venues. The goal of this project is to explore the best districts in Greater Taipei for opening a venue by using data analysis and machine learning techniques. The project involves determining which districts have the highest population density, economic diversity, and business potential.
Key questions explored:
- Which districts have the highest population density?
- Which districts have the best economic diversity?
- Which districts have the most potential for new businesses?
The repository includes the following files:
README.md
: This file contains the project description and setup instructions.The Battle of the Neighborhoods I.ipynb
: The Jupyter notebook for Week 1, where data collection and initial exploration were done.The Battle of the Neighborhoods II.ipynb
: The Jupyter notebook for Week 2, focusing on data processing, visualization, and clustering.The Battle of the neighborhoods.pdf
: The final report summarizing the project findings and analysis.The Battle of the neighborhoods_PPT.pdf
: A presentation summarizing the results of the analysis.
The data used in this project includes:
- Geographic Data: Collected from Taiwan’s government data platform and Wikipedia for municipal information.
- Venue Data: Retrieved via the Foursquare API to analyze existing venues in the districts.
- Demographic Data: Scraped and processed from Wikipedia and government platforms.
- Data Cleaning: Using Python's
pandas
library to clean and organize the demographic data. - Population Density Analysis: Population density was calculated for each district, and geographic data was visualized using the
folium
library. - Venue Data Analysis: Venue data was collected from Foursquare API, categorized, and analyzed using K-means clustering to identify areas with diverse economic activity.
- Clustering: K-means clustering was applied to group the districts into 8 clusters based on their venue types. The elbow method helped determine the optimal number of clusters.
The project identified several key findings:
- Yonghe and Daan were found to have the highest population density.
- Banqiao and Daan exhibited a diverse mix of economic activities, making them ideal for new businesses.
- The clustering analysis highlighted 8 distinct groups of districts based on the types of venues.
- Python
- Jupyter Notebooks
- Pandas
- Folium
- GeoJSON
- Foursquare API
- K-means Clustering (Scikit-learn)
To run this project locally, follow these steps:
- Clone the repository:
git clone https://github.com/boba-milktea/Coursera_Capstone.git
- Install the required Python libraries:
pip install pandas folium geopy requests sklearn
- Open the Jupyter notebooks to explore the project:
jupyter notebook
- Run the notebooks to see the analysis and results.