Cognihack Data Science Track |
---|
Predicting repeat shopping likelihood with Python, Spark and Watson Machine Learning |
Our exercise will take you through the process of creating a predictive model in Python using the data manipulation and machine learning libraries distributed with Spark.
Once we've worked through the process of reading, understanding and preparing our data and have built a simple model together, we'll deploy it to the Watson Machine Learning service and make it available as a real-time scoring service.
You should spend the remaining time working as a group to speculate on how you might improve these predictions. The cognihack tutors will endeavour to assist with any experimentation to help you create and evaluate refinements to the baseline model.
The learning goals of this exercise are:
- Loading CSV files into an Apache® Spark DataFrame.
- Exploring the data using the features within:
a) Spark's data wrangling Python API: pyspark.sql;
b) the pandas data wrangling library; and
c) matplotlib for exploratory plots. - Engineering some basic predictive features, again using pyspark.sql and Spark user defined functions (UDFs).
- Preparing the data for training and evaluation.
- Creating an Apache® Spark machine learning pipeline.
- Training and evaluating a model.
- Persisting a pipeline and model in Watson Machine Learning repository.
- Deploying the model for online scoring using Wastson Machine Learning API.
- Scoring sample scoring data using the Watson Machine Learning API.
Before we begin working through this notebook, you must perform the following setup tasks:
- Sign up for the IBM Data Science Experience (using w3 credentials) and create a new project;
- Retrieve the Acquire Valued Shoppers Challenge data from the Cognihack Box folder;
- Import the data (in csv format) into the Data Science Experience as a 'data asset' (the transactions_subset.csv file will require decompression prior to upload);
- Create a new notebook, using the 'import from URL' option. Use the following URL: https://github.ibm.com/Stuart-Lynn/cognihack_dsx/edit/master/cognihack_datascience_participant.ipynb
- Make sure the notebook is configured to use a Spark 2.0 kernel and Python 2.x; and
- Create a Watson Machine Learning Service instance (a free plan is offered).