Segment customers step by step

Previously I wrote about neural networks (click here to see it). Neural networks and all other "supervised methods" are used when you have a sample of values to predict. But when you know what you want to achieve but do not have a sample of the value to predict, the so‑called "unsupervised methods" are used.

A classic problem where this kind of method is applied is customer segmentation, where the segments/groups are not known in advance. Among the methods, one of the most famous is K‑Means.

K‑Means is an algorithm used to find groups of individuals with similar characteristics. Similarity or difference is calculated based on the Euclidean distance of their numerical attributes.

Below we will follow a step‑by‑step example of how to segment customers in a simplified case.

To make this example, we will generate a random sample of data in two dimensions per case: data traffic (megabytes) and voice traffic (minutes). Each case has been simulated so that it follows a behavior pattern similar to a group.

With K‑Means we will try to find these four groups, which do exist but are unknown to us.

Below is a plot of the four groups:

Real Problem:

We are the Data Scientist of a prestigious mobile phone company (personal experience of the writer) and we are tasked with segmenting/classifying customers (again, nothing far from the writer’s reality). When reviewing the database, we realize that we only have the monthly data and voice consumption of each customer (there could be many more: top‑ups, text messages, calls to customer service, etc.).

Since the scales of data consumption (MB) and voice consumption (minutes) are very different, and K‑Means measures similarity by Euclidean distance between cases, we will scale the values between 0 and 1, obtaining the following data:

Now we are ready to begin. Since K‑Means does not determine the number of groups, it is run for different numbers and the results of each case are compared. The choice of the number of groups lies somewhere between mathematics and judgment.

We will run K‑Means from 1 to 6 groups and use both the mathematical method and the "judgment" method to determine the number of groups.

The first thing to observe is the level of fit for each case. This is the percentage of variance explained by each segmentation, calculated as between_SS / total_SS (I won’t go into detail). When segmentation no longer improves significantly as the number of groups increases, it means the new group is similar to an existing one, so they should be the same.

Below is the fit plot for each case:

It seems that more than 4 groups does not make sense. Now let’s compare the aggregated values of each group: we will look at their average values and the number of elements in each group.

Now we will proceed with the non‑numerical analysis, which consists of naming the groups. When two groups cannot be differentiated, it means there is no point in separating them (just out of curiosity, we will compare them with the "real" groups).

1 Group


group_number min mb number
1) Average of all 71 447 120

2 Groups


group_number min mb number
1) Few Minutes 9 538 60
2) Many Minutes 134 356 60

3 Groups


group_number min mb number
1) Low Traffic 13 111 31
2) Many Minutes 134 356 60
3) Many MBs and few Minutes 5 994 29

4 Groups


group_number min mb number
1) Many MBs 5 994 29
2) Many Minutes 148 23 30
3) Low Traffic 13 111 31
4) High Traffic 120 690 30

5 Groups


group_number min mb number
1) High Traffic 111 699 22
2) High Traffic, especially minutes* 143 666 8
3) Many MBs 5 994 29
4) Many Minutes 148 23 30
5) Low Traffic 13 111 31
  • Groups 1 and 2 are very similar and group 2 has very few customers; it makes no sense to separate them.

6 Groups


group_number min mb number
1) High Minutes Consumption 148 23 30
2) Very High MB Consumption 3 1172 10
3) Low Consumption, a bit more MBs 9 288 10
4) High MB Consumption 6 917 18
5) Low Consumption 13 51 22
6) High Consumption 120 690 30
  • Groups 1, 2 and 4 are very similar, basically characterized by customers who use MBs but very few minutes.

Therefore, there are 4 groups, which we will name:

group_number min mb number
MB User 5 994 29
Minutes User 148 23 30
Low Consumption 13 111 31
High Consumption 120 690 30

Now that we have the segmentation, let’s compare the initial data with the obtained one. In the following plot, the color represents the group found by K‑Means and the shape of the point represents the original group to which the customer belonged (according to the generated data):

Indeed, K‑Means detected almost all the initially generated groups.

A corollary: this is how a mobile operator redefined its products. We realized that the competition had products aimed at high‑consumption and low‑consumption segments, while we noticed that we had customers who mainly used data or voice. So we decided to focus on this segment that had no competition.

If you liked this article, I invite you to read about:

Greetings!

If you liked it, follow us on any of our channels; all posts will appear there.

And don’t forget to share on your social media; your visits are my motivation.

Appendix: The R code used for the analysis:

set.seed(1984)
library(ggplot2)
library(plyr)
library(xtable)

# Generate sample
muestra = rbind(
  data.frame(min = rnorm(30,10,10), mb = rnorm(30,100,100), grupo = "bajo"), # Customers with average consumption
  data.frame(min = rnorm(30,120,20), mb = rnorm(30,700,100), grupo = "alto"), # Customers with average consumption
  data.frame(min = rnorm(30,3,10), mb = rnorm(30,1000,200), grupo = "datos"), # Customers with high data consumption
  data.frame(min = rnorm(30,150,20), mb = rnorm(30,15,60), grupo = "voz") # Customers with high voice consumption
)
muestra[,1:2] = apply(muestra[,1:2],2,function(x) ifelse(x<0,0,x))
qplot(min,mb,data=muestra,xlab = "Monthly Minutes", ylab = "Monthly MB", color = grupo)

# Normalization of data
muestra = transform(muestra, n_min = min/max(min), n_mb = mb/max(mb))
qplot(n_min,n_mb,data=muestra,xlab = "Normalized Monthly Minutes", ylab = "Normalized Monthly MB")

# Run the model for 6 cases
codo = data.frame()
set.seed(2016)
for(grupos in 1:6){
  modelo = kmeans(muestra[,4:5],grupos, iter.max = 100)
  codo = rbind(codo,
               data.frame(grupos = grupos,
                          between_SS = modelo$betweenss,
                          total_ss = modelo$totss,
                          tot.withinss = modelo$tot.withinss,
                          value =  modelo$betweenss/modelo$totss)
               )
  muestra[,paste0("kmeans_",grupos)] = as.character(modelo$cluster)
}

# Plot of the fit level
qplot(x = grupos, y = value, data = codo, geom="line", ylab = "Percentage of Fit", xlab = "Number of groups")

# Summary for each model
resumen = data.frame()
for(n in 1:6){
  tabla = ddply(muestra, paste0("kmeans_",n), function(x) data.frame(min = mean(x$min), mb = mean(x$mb), numero = nrow(x)) )
  colnames(tabla)[1] = "grupo_numero"
  resumen = rbind(resumen,
                  data.frame(numero_de_grupos = n, tabla)
                  )
  print(xtable(tabla,digits = 0), type="HTML", include.rownames=FALSE)
}
print(xtable(resumen,digits = 0), type="HTML", include.rownames=FALSE)

# Plot of each group
qplot(min,mb,data=muestra,xlab = "Monthly Minutes", ylab = "Monthly MB", color = kmeans_6, shape = grupo)

Be the first to comment

Leave a Reply

Your email address will not be published.




This site uses Akismet to reduce spam. Learn how your comment data is processed.