Skip to content

An analysis of customer data using unsupervised learning methods such as K-Means and Hierarchial Clustering

Notifications You must be signed in to change notification settings

OjeWilliams/Customer-Segmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Customer-Segmentation

An analysis of customer data using unsupervised learning methods such as K-Means, Hierarchical Clustering and Principal Component Analysis

Introduction

Customer Segmentation is the practice of dividing the customers of a company into different groups that possess similarities in each group. The objective of segmenting the customers is to ascertain how to relate to customers in each segment so that we can maximize the value of each customer to the business. Ideally this allows us to tailor our appraoch to each customer group based on their interests, demographic profile or even preferered method of communication and interaction.

In this notebook we seek to identify customer segments so that we can deliver insights about our data through clusterig and means. This will be used to drive business decisions in tailoring approaches to the different customer bases/types. The data was obtained from Kaggle and it includes features such as Customer ID, Age, Gender, Annual Income and Spending Score (which was assigned based on the internal metrics of the company.

The full break down of everything that was explored can be found here in the Jupyter notebook.

Setting up the Project

Importing Data and Libraries

  1. Load libraries and obtain data from Kaggle.
  2. Store data in an appropriate dataframe.
  3. Set a random seed for reproducibility.

Part 1 - Exploratory Data Analysis

In this section we will investigate the data, cleaning and modifying where necessary to make it easier to manipulate and see what insights we can find.
Also you can look at the output file of pandas- profiling here and you can download the file here (it is interactive once downloaded)

Below we can see the first 10 entries of the dataset as well as seeing that there are no missing values.

Some of the interesting things found:

  • Gender percentages



  • Distribution Plots



  • Mean Spending Score



  • Median Annual Income



Part 2 - Data Pre-processing

In this this section we deal with preparing the data for it to be used in our models.

Dealing with categorical data columns

Here we have two issues to address, we have to transform our categorical data values into numerical values and then we have to scale our data. Categorical variables are those that are labels rather than numeric values such as 'color' or a 'place/location' or in our case gender, i.e male or female. We need to address these categorical data values as some of our methods below will not work well with them and hence must be converted to numerical values. Scaling our data means that we are normalizing the range of the features of our data as a means to combat the large range of values in our raw data.

We addressed both tasks as seen below:

Part 3 - Build and Evaluate Models

K-Means Clustering

K-Means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. K-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

Determining the Ideal number of Clusters in the dataset

When determining the number of clusters to use one of the important methods is using what is known as the Elbow Curve Method This method involves plotting the explained variance( Sum of Squared Distance) vs the Number of clusters (k) and choosing the number of clusters based on where the 'elbow' occurs. The elbow is the cutoff point at which increasing the number of clusters is no longer worth it. We also utilized the Silhouette Score to help with selection.

I had a choice between two possible cluster numbers and I choice to build models based on 6 rather than 3 clusters.

  • 6 Cluster Comparison Plots


  • 3D plot of 6 Cluster Model


  • Cluster Summary


Hierarchical Clustering

In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis which seeks to build a hierarchy of clusters. In other words it is an algorithm that groups similar objects into groups called clusters. There are mainly two types

Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

In this notebook we used the Agglomerative approach and used a dendrogram to visualize results

  • Total dendrogram

  • Partial Denmdrogram (suggesting 7 clusters)

Principal Component Anaylsis

Principal component analysis (PCA), is one of the most popular unsupervised machine learning techniques. It is primarily used as a dimensionality reduction approach with applications in data visualization and feature extraction, allowing for increased interpretability while minimizing information loss. In dimensionality reduction, the goal of the inferred model is to represent the input data with the fewest number of dimensions, while still retaining information such as the variability in the data that is relevant to investigation. New component generated are combinations of proportions of existing features and these components explain the maximum variance in the model.

  • PCA Summary

Conclusion

From our investigation into this data set using unsupervised maching learning methods we have found that:

  • K-Means Clustering initially suggested that we can segment our customers into 3 or 6 groups.

  • Hierarchial Clustering suggested that we can segment our customers into 7 groups.

  • Principal Component Analysis returned 2 components that explained 67% percent of the variance in our data. Using these two components we were able to build a better K-Means model that suggested our customers could be put into 4 groups.

From the 6 Cluster model I was able to create coteries to describe each group in detail while also providing insights into what might appeal to each group. This can be found in the notebook.

References

About

An analysis of customer data using unsupervised learning methods such as K-Means and Hierarchial Clustering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published