Customer Segmentation Using Hadoop

This project uses Hadoop MapReduce to perform customer segmentation based on the dataset from an Online Retail store. The objective is to calculate three key metrics for each customer:

Total Spend
Frequency (number of purchases)
Recency (time in days since the last purchase)

Features

Hadoop-based scalable processing for large datasets.
Key customer segmentation metrics: TotalSpend, Frequency, and Recency.
Preprocessing script for dataset preparation.
MapReduce implementation for distributed data processing.

Technologies Used

Hadoop: A framework for distributed storage and processing of big data.
Java: The primary programming language used for writing the MapReduce logic.
HDFS: The Hadoop Distributed File System for storing large datasets.
MapReduce: The processing model used to process data.
Python: For dataset preprocessing and analysis.
Libraries: pandas, matplotlib, seaborn, scipy.stats
Linux (Ubuntu): Operating System.

Dataset Structure

The dataset used in this project is the Online Retail Dataset. It contains all transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retailer. The company mainly sells unique all-occasion gifts, with many of its customers being wholesalers.

Sample of the dataset:

InvoiceNo	StockCode	Description	Quantity	InvoiceDate	UnitPrice (Sterling)	CustomerID	Country
536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	01/12/2010 8:26	2.55	17850.0	United Kingdom
536365	71053	WHITE METAL LANTERN	6	01/12/2010 8:26	3.39	17850.0	United Kingdom
536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	01/12/2010 8:26	2.75	17850.0	United Kingdom
536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	12/1/2010 8:26	3.39	17850.0	United Kingdom
536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	12/1/2010 8:26	3.39	17850.0	United Kingdom

Getting Started

To get started with this project, follow these steps:

Prerequisites

Install Hadoop on your local machine or use a cloud-based Hadoop cluster.
Ensure that Java (JDK 8 or later) is installed on your system.

Install Python and required libraries:

pip install pandas matplotlib seaborn scipy

Download the Online Retail Dataset.

Setup Instructions

Clone this repository:

git clone https://github.com/chouaib-629/CustomerSegmentation.git

Navigate to the project directory:
```
cd CustomerSegmentation
```
The downloaded dataset is in Online Retail.xlsx format. Save it as online_retail.csv using any spreadsheet tool.
Preprocess the dataset using the provided Python script:
- .py format:
```
python preprocessing/main.py
```
- .ipynb format:
  
  Open and run preprocessing/main.ipynb in Jupyter Notebook.

Compile the Java classes:

javac -classpath `hadoop classpath` -d compiled_classes src/*.java

Package the classes into a JAR file:

jar cf CustomerSegmentation.jar -C compiled_classes/ .

Usage

Step 1: Upload Dataset to HDFS

Create a directory in HDFS to store the dataset:
```
hdfs dfs -mkdir /CustomerSegmentation
```

Upload the preprocessed dataset to HDFS:

hdfs dfs -put processed_online_retail.csv /CustomerSegmentation/

Step 2: Run the MapReduce Job

Run the Hadoop job using the following command:

hadoop jar CustomerSegmentation.jar Driver /CustomerSegmentation/processed_online_retail.csv /CustomerSegmentation/output/

Step 3: View the Output

To view the results of the MapReduce job, use the following command:

hdfs dfs -cat /CustomerSegmentation/output/part-r-00000

Optional: Save the Output Locally

Copy the output file from HDFS to your local storage for further analysis:

hdfs dfs -get /CustomerSegmentation/output/part-r-00000 output/result.csv

Contributing

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch:
```
git checkout -b feature/feature-name
```
Commit your changes:
```
git commit -m "Add feature description"
```
Push to the branch:
```
git push origin feature/feature-name
```
Open a pull request.

Contact Information

For questions or support, please contact Me.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
compiled_classes		compiled_classes
output		output
preproccessing		preproccessing
src		src
CustomerSegmentation.jar		CustomerSegmentation.jar
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Customer Segmentation Using Hadoop

Table of Contents

Features

Technologies Used

Dataset Structure

Getting Started

Prerequisites

Setup Instructions

Usage

Step 1: Upload Dataset to HDFS

Step 2: Run the MapReduce Job

Step 3: View the Output

Optional: Save the Output Locally

Contributing

Contact Information

About

Uh oh!

Releases

Packages

Languages

chouaib-629/CustomerSegmentation

Folders and files

Latest commit

History

Repository files navigation

Customer Segmentation Using Hadoop

Table of Contents

Features

Technologies Used

Dataset Structure

Getting Started

Prerequisites

Setup Instructions

Usage

Step 1: Upload Dataset to HDFS

Step 2: Run the MapReduce Job

Step 3: View the Output

Optional: Save the Output Locally

Contributing

Contact Information

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages