NYC Business Longevity Analysis

Dataset

Legally_Operating_Businesses.csv
- Source: NYC Open Data: Legally Operating Business
- URL: https://data.cityofnewyork.us/Business/Legally-Operating-Businesses/w7w3-xahh. (accessed: 11.20.2019)
- Description: This dataset contains200375instances of the businesses that have existed or are still operating in New York City. Eachinstance contains location information: City, Street, Zip Code; license information: License Type (Businessor Personal), License Creation/Expiration Date; and business information: names, phones and Industry Type
new_york.csv
- Source: Zip Code Demographics By State And County Batch Report
- URL:https://www.cdxtech.com/tools/bulk/demographics/state-and-county/?from=singlemessage&isappinstalled=0. (accessed: 11.22.2019)
- Description: The demographic data contains all2153zip codes of the New York state. Demographic information includes Population, Ethnicity Distribution,Households, Sex Percentage, Income etc.

Project Goal

Our project aims to

predict the lifespan of a business;
understand how different factors impact the lifespan

from the two data sets above.

Jupyter Notebooks

Datacleaning_Attempt.ipynb: First Attempt in data cleaning; Include brainstormed basic ideas
Regression_FeatureSelection.ipynb: Linear Regression in depth (was not included in the final report); backward selection, vif(remove colinearity), assumption check
Cox's_Proportional_Hazard_Analysis.ipynb: Notebook walks through the entire project

Table of Contents

Load Raw Data

Clean Raw Data

Clean raw business data

Remove non-NYC businesses

Keep columns that are relevant to the problem of interest

Check selection bias

Drop rows with NaN values

Clean raw nyc data

Only keep columns that are relevant to the problem of interest.

Create more useful features from current columns

Merge two dataframe by column ZIP

Feature Processing

One-hot encode on industry type

Create Target Variables

Generate clean data for modeling

Baseline: Simple Regression

Model 1: Kaplan-Meier Estimate

Model 2: Cox Proportional Hazards regression model

Model 3: Multiclass Regression

Algorithms: Random Forest, Decision Tree, K-nearest Neighbors, Logistic Regression

Metrics: Confusion Matrix, Precision, Recall, F1-score

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.DS_Store		.DS_Store
Cox's_Proportional_Hazard_Analysis.ipynb		Cox's_Proportional_Hazard_Analysis.ipynb
Datacleaning_Attempt.ipynb		Datacleaning_Attempt.ipynb
Legally_Operating_Businesses.csv		Legally_Operating_Businesses.csv
README.md		README.md
Regression_FeatureSelection.ipynb		Regression_FeatureSelection.ipynb
new_york.csv		new_york.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Business Longevity Analysis

Dataset

Project Goal

Jupyter Notebooks

Table of Contents

About

Releases

Packages

Contributors 3

Languages

the-yanqi/DS1001project

Folders and files

Latest commit

History

Repository files navigation

NYC Business Longevity Analysis

Dataset

Project Goal

Jupyter Notebooks

Table of Contents

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages