E-COMMERCE ANALYSIS

INTRODUCTION

One of the reasons that businesses fail is lack of research, not being in the right market, and not reaching the right people. Research shows that 21.5% of small businesses and startups fail within the first year, 30% within the second year, 50% within the fifth year, and 70% within their 10th year (source). 42% of business failures are a result of not understanding what people actually need or want (source). This number is much higher for e-commerce and online businesses. Within four months, 90% of online businesses fail (source). In order for businesses to succeed, they must be able to identify what metrics will improve profits for online stores. For this project, we will be exploring and analyzing a customer’s propensity to purchase and provide recommendations on how to increase and improve conversion rates and mitigate attribution issues.

DATA

We will be using Google's Analytic 360 Data obfuscated sample data from the Google Merchandise store on their BigQuery Rest API. The sample dataset provides 12 months (August 2016 to August 2017), however we will only be focusing on 1 of those months within the data set (July 1, 2017 to July 31, 2017) and analyzing behavior up to 2 weeks before a transaction is made.

The overview describes the data as typical of what an ecommerce website would see and includes the following information:

Traffic source data
Content data
Transactional data

CONSTRAINTS

All users have view access to the dataset and can only query the dataset and generate reports
Data for some fields is obfuscated such as fullVisitorId
Data for some field is removed such as clientId, adWordsClickInfo and geoNetwork
“Not available in demo dataset” will be returned for STRING values
“null” will be returned for INTEGER values when querying the fields containing no data

SET-UP

To use BigQuery, you will need to sign up for an account, get an API key, and download the google.cloud package to query the API from jupyter notebook. Follow the instructions here for quick start and authentication. You may also query from Google Cloud if that’s easier.

LANGUAGES USED:

Python
Standard SQL

PACKAGES USED:

sklearn for preprocessing, modeling the data, and calculating metrics
pandas and numpy for data wrangling and reformatting
matplotlib and seaborn for data and metric visualization
google.cloud to query the API from jupyter notebook

EDA (OVERVIEW)

Before wrangling the data, we need to visualize how the data is formatted.

For a full breakdown and justification of features chosen, please see 01_eda_data_overview.ipynb.

Our goal is to find features that give us a high-level overview of each unique visitor’s behavior over multiple sessions. We are concerned about a customer's propensity to buy based on transactions and behaviors over multiple sessions by days and weeks. Some features of this data are nested and need to be accessed within a column via a query. The high-level features are as follows:

visitorId
visitNumber
visitId
visitStartTime
date
totals
trafficSource
device
geoNetwork
customDimensions
hits
fullVisitorId
userId
clientId
channelGrouping
socialEngagementType

Based on our thorough examination of each feature, the column that will be most useful in creating our features and labels will be the following features in the totals column:

fullVisitorId - to identify each unique visitor regardless of what day they visit on
date - to view total transactions by day
visits - this represents the number of sessions
hits - total number of interactions within a session
pageviews - total number of page views in a session
bounces - total bounces. These trigger in the Analytics server if only a single interaction is made on the whole website
sessionQuality - an estimate of how close a particular estimate was close to transacting from 1 to 100 with 1 being the lowest session quality and 100 being close to transacting
timeOnSite - total time of each session in seconds

Based on the question we're trying to answer, we will be using transactions to help us create our labels and whether a visitor makes a purchase.

The other features, as noted above have either been deprecated, are not available in the demo dataset, are duplicates, or are not aligned with our goal of providing a high-level overview of each visitor’s behavior.

DATA CLEANING AND DATA WRANGLING

For this next step, I will be using functions to call each query. We need to set up the start and end dates as well as indicate whether this dataset will be a train or test set.

We will be performing the following steps (please note each step except for the final step will create a temporary table in your BigQuery project. You can view the results of your table there):

Getting the transaction dates where the earliest known transaction of each user occurs
Getting and creating the features by visitor id and by each user's visit date from the transaction date. Because each feature is recorded for each session, we need to create a training data set that aggregates all these features. We will be summing all features except for sessionQuality. We will average the session quality by day and round to two decimal points. We need to separately query visitors who made purchases from visitors who didn't make purchases
Breaking out each feature by difference in days and weeks prior to the transaction date. Again, we need to separately query visitors who made purchases from visitors who didn't make purchases. For users that made no transactions, we will pick 5 random days as our false 'transaction' dates and remove any duplicate visitors who may have made purchases on other days and none on others. We wil then break out the features in the following format:
- day0 - behavior on site on the day of the first transaction
- day1 - behavior on site 1 day before the first transaction
- day2 - behavior on site 2 days before the first transaction
- day3 - behavior on site 3 days before the first transaction
- day4_6 - behavior on site 4-6 days before the first transaction
- w2 - behavior on site 7-14 days before the first transaction (or behavior within two weeks before the first transaction)

Combining the temporary true feature and false feature tables in step 3 and labeling them (1 for purchase, 0 for no purchase) to create our training dataset. We will also limit the number of false transactions to 19,000 and downsample our data to provide a reasonable ratio of true and false purchases. For further documentation, go to the python folder of this project and view all queries. For brevity, we will combine each query and simultaneously create our temporary tables for the training dataset.

The final query will not create a temporary table. Instead, we will transform the results into a dataframe that we can analyze, preprocess, and model in the next notebooks.

CREATE VALIDATION DATA

In order to accurately see if our model works we will pick a date interval completely independent of the train or test set and apply the same pipeline to create our validation set.

EDA OF TRANSFORMED FEATURES

% of Missing Data:

LABEL	%
visitor_id	0.00
day0_time_on_site_seconds	80.69
day1_time_on_site_seconds	85.97
day2_time_on_site_seconds	87.45
day3_time_on_site_seconds	87.32
day4_6_time_on_site_seconds	72.10
w2_time_on_site_seconds	33.09
day0_hits	80.69
day1_hits	85.97
day2_hits	87.45
day3_hits	87.32
day4_6_hits	72.10
w2_hits	33.09
day0_pageViews	80.69
day1_pageViews	85.97
day2_pageViews	87.45
day3_pageViews	87.32
day4_6_pageViews	72.10
w2_pageViews	33.09
day0_bounces	80.69
day1_bounces	85.97
day2_bounces	87.45
day3_bounces	87.32
day4_6_bounces	72.10
w2_bounces	33.09
day0_sessions	80.69
day1_sessions	85.97
day2_sessions	87.45
day3_sessions	87.32
day4_6_sessions	72.10
w2_sessions	33.09
day0_session_quality	80.69
day1_session_quality	85.97
day2_session_quality	87.45
day3_session_quality	87.32
day4_6_session_quality	72.10
w2_session_quality	33.09
label	0.00

There is a lot of missing data, but this is normal because users do not interact with a website everyday. We don't want to fill this information or delete it as the missing data is important to understanding the behavior of each user.

DISTRIBUTION

Based on the ratio between no transactions and confirmed transactions we can see that user transactions only make up about 4.8% of the data. This implies that our labels are highly skewed. We will need to take this into account when building our model and considering what metrics will be appropriate to use and possibly normalizing or standardizing the data for preprocessing. Let's look at the behavior and distribution of each feature by label to get an overall idea of how each visitor behaves the days and weeks prior to transaction or lack thereof.

Based on the general overview and distribution of features for users who made purchases, we can see that there are more interactions less than 6 days leading up to the transaction in comparison to the 7-14 days prior.

The time on site, hits, page views, bounces, sessions, and session quality all experience an increase in user interactions, particularly on the day0, day1, and day2 prior to purchase. This intuitively makes sense that users spend more time on the site the day of the transaction. It also makes sense that they spend more and more time on site leading up to the purchase since visitors potentially do research, compare prices, and determine their final decisions.

Based on the general overview and distribution of features for users who made no purchases as the "transaction" date gets closer, less and interactions are made on site. The time on site, hits, page views, bounces, sessions, and session quality all experience a slight increase in user interactions, particularly between day0 and day1.

However, in comparison to users who made purchases, they drastically spend less time on the website. Interestingly, the total number of bounces per user per day was on average higher for no purchases versus users who made purchases. This suggests bounces might be a significant indicator in predicting a customer's propensity to purchase.

Based on the heatmap above, hits and page views are very positively correlated. We may need to be aware of multicollinearity for those features. Here we can see that there is a negative correlation between time on site, hits, page views and session quality with bounces. There is also a strong positive and negative correlation for all features 1 day before a transaction except sessions.

Looking at a preview of the data once again and at the percentage of customers who don't purchase (label 0) vs. customers who do purchase (label 1), we can see that customers who make a purchase make up less than 5% of the dataset. In other words, the data is highly imbalanced and customers making a purchase is considered a rare event. The violin boxplot also shows that there are significant outliers in the data.

PRE-PROCESSING

We will perform each the following pre-processing techniques to mitigate any biases and inaccuracies in predicting customer propensity:

Scaling (Standardization)
Impute missing data

The null values in the data are valuable in and of itself. They represent the days that each visitor was not present on the site. We can impute the data by filling each null value with zeroes.

Scaling the data down did normalize the standard deviation close to 1 and the mean close to 0. However, it looks like we are still dealing with very large outliers and a skewed distribution.

METHODOLOGY

We will be using the following models for our transformations:

CatBoost
Random Forest
Logistic Regression

Please note that depending on the goals of the company, certain features included may be dropped or added as needed The model in its current form is simplified. It has no consideration for platform constraints which would need to be defined by the company

We will be looking at overall accuracy on the first pass to determine if the model is sampling from the data properly. We will then look at the following metrics:

ROC curve to measure True Positive Rating
LIFT score for customer segmentation information; this is our most important feature in identifying our target users
SHAP value (impact on model output) for CatBoost or feature importance on Random Forest and Logistic Regression

RESULTS

BEST MODEL: CATBOOST

CatBoost had the highest LIFT score with a 99.91% model accuracy, successfully identified most users who purchased and had a True Positive Rating similar to Random Forest Model without overfitting.

Random Forest

The Random Forest model in comparison had a lower LIFT score.

WORST MODEL: LOGISTIC REGRESSION

The model had a 100% accuracy, but had the lowest lift score and lowest true positive rating This suggest suggests severe overfitting occuring to the data

Feature Importance:

Based on analysis of each feature, user engagement increases as they approach the purchase date (day0). Transactions and activities are highly correlated to the day of purchase. Page views, session quality, and time on site had the highest positive correlations. Bounces have a negative correlation. This suggests that the company should continue minimizing bounces or single touch points.

APPLICATIONS

Some applications for this model include:

Identifying the segments of users likely to organically convert and allocating a budget to market to this segment and mitigate attribution
Optimizing email and marketing campaigns to specific users
Providing discounts for other segments less likely to convert
Encouraging users to make purchases via paid media marketing campaigns (on Amazon, Facebook, Instagram, etc.)
Improving customer service by Rerouting and prioritizing calls based on propensity to purchase

LESSONS LEARNED

In conclusion, there are numerous other models that could have been explored using Google's Big Query Data including:

ADA Boost
XG Boost

We could've also corrected for overfitting in the models using:

Cross Validation
Bootstrapping.

Other factors to improve the model include:

Refining the preprocessing pipeline by normalizing the distribution of the data via:
- upsampling
- downsampling
- log transformation
filling null values with:
- mean of each feature
- median of each feature

THANK YOU!

Thank you Nik for being an amazing mentor. It was wonderful working with you.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
data		data
eda		eda
metrics		metrics
presentation		presentation
python		python
yml env file		yml env file
.gitignore		.gitignore
01_eda_data_overview.ipynb		01_eda_data_overview.ipynb
02_data_wrangling.ipynb		02_data_wrangling.ipynb
03_exploratory_data_analysis_transformed_data.ipynb		03_exploratory_data_analysis_transformed_data.ipynb
04_pre-processing_&_modeling.ipynb		04_pre-processing_&_modeling.ipynb
Google E-Commerce Analysis Proposal.pdf		Google E-Commerce Analysis Proposal.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

E-COMMERCE ANALYSIS

INTRODUCTION

DATA

CONSTRAINTS

SET-UP

LANGUAGES USED:

PACKAGES USED:

EDA (OVERVIEW)

DATA CLEANING AND DATA WRANGLING

CREATE VALIDATION DATA

EDA OF TRANSFORMED FEATURES

% of Missing Data:

DISTRIBUTION

PRE-PROCESSING

METHODOLOGY

RESULTS

BEST MODEL: CATBOOST

Random Forest

WORST MODEL: LOGISTIC REGRESSION

Feature Importance:

APPLICATIONS

LESSONS LEARNED

THANK YOU!

SOURCES

About

Uh oh!

Releases

Packages

Languages

justinepicar/ecommerce-analysis

Folders and files

Latest commit

History

Repository files navigation

E-COMMERCE ANALYSIS

INTRODUCTION

DATA

CONSTRAINTS

SET-UP

LANGUAGES USED:

PACKAGES USED:

EDA (OVERVIEW)

DATA CLEANING AND DATA WRANGLING

CREATE VALIDATION DATA

EDA OF TRANSFORMED FEATURES

% of Missing Data:

DISTRIBUTION

PRE-PROCESSING

METHODOLOGY

RESULTS

BEST MODEL: CATBOOST

Random Forest

WORST MODEL: LOGISTIC REGRESSION

Feature Importance:

APPLICATIONS

LESSONS LEARNED

THANK YOU!

SOURCES

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages