This project uses Hadoop MapReduce to perform customer segmentation based on the dataset from an Online Retail store. The objective is to calculate three key metrics for each customer:
- Total Spend
- Frequency (number of purchases)
- Recency (time in days since the last purchase)
- Hadoop-based scalable processing for large datasets.
- Key customer segmentation metrics: TotalSpend, Frequency, and Recency.
- Preprocessing script for dataset preparation.
- MapReduce implementation for distributed data processing.
- Hadoop: A framework for distributed storage and processing of big data.
- Java: The primary programming language used for writing the MapReduce logic.
- HDFS: The Hadoop Distributed File System for storing large datasets.
- MapReduce: The processing model used to process data.
- Python: For dataset preprocessing and analysis.
- Libraries: pandas, matplotlib, seaborn, scipy.stats
- Linux (Ubuntu): Operating System.
The dataset used in this project is the Online Retail Dataset. It contains all transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retailer. The company mainly sells unique all-occasion gifts, with many of its customers being wholesalers.
Sample of the dataset:
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice (Sterling) | CustomerID | Country |
---|---|---|---|---|---|---|---|
536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 01/12/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
536365 | 71053 | WHITE METAL LANTERN | 6 | 01/12/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 01/12/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
To get started with this project, follow these steps:
-
Install Hadoop on your local machine or use a cloud-based Hadoop cluster.
-
Ensure that Java (JDK 8 or later) is installed on your system.
-
Install Python and required libraries:
pip install pandas matplotlib seaborn scipy
-
Download the Online Retail Dataset.
-
Clone this repository:
git clone https://github.com/chouaib-629/CustomerSegmentation.git
-
Navigate to the project directory:
cd CustomerSegmentation
-
The downloaded dataset is in
Online Retail.xlsx
format. Save it asonline_retail.csv
using any spreadsheet tool. -
Preprocess the dataset using the provided Python script:
-
.py
format:python preprocessing/main.py
-
.ipynb
format:Open and run
preprocessing/main.ipynb
in Jupyter Notebook.
-
-
Compile the Java classes:
javac -classpath `hadoop classpath` -d compiled_classes src/*.java
-
Package the classes into a JAR file:
jar cf CustomerSegmentation.jar -C compiled_classes/ .
-
Create a directory in HDFS to store the dataset:
hdfs dfs -mkdir /CustomerSegmentation
-
Upload the preprocessed dataset to HDFS:
hdfs dfs -put processed_online_retail.csv /CustomerSegmentation/
Run the Hadoop job using the following command:
hadoop jar CustomerSegmentation.jar Driver /CustomerSegmentation/processed_online_retail.csv /CustomerSegmentation/output/
To view the results of the MapReduce job, use the following command:
hdfs dfs -cat /CustomerSegmentation/output/part-r-00000
Copy the output file from HDFS to your local storage for further analysis:
hdfs dfs -get /CustomerSegmentation/output/part-r-00000 output/result.csv
Contributions are welcome! To contribute:
-
Fork the repository.
-
Create a new branch:
git checkout -b feature/feature-name
-
Commit your changes:
git commit -m "Add feature description"
-
Push to the branch:
git push origin feature/feature-name
-
Open a pull request.
For questions or support, please contact Me.