Segment customers step by step

Previously I wrote about neural networks (click here to see it). Neural networks and all other "supervised methods" are used when you have a sample of values to predict. But when you know what you want to achieve but do not have a sample of the value to predict, the so‑called "unsupervised methods" are used.

A classic problem where this kind of method is applied is customer segmentation, where the segments/groups are not known in advance. Among the methods, one of the most famous is K‑Means.

K‑Means is an algorithm used to find groups of individuals with similar characteristics. Similarity or difference is calculated based on the Euclidean distance of their numerical attributes.

Below we will follow a step‑by‑step example of how to segment customers in a simplified case.

To make this example, we will generate a random sample of data in two dimensions per case: data traffic (megabytes) and voice traffic (minutes). Each case has been simulated so that it follows a behavior pattern similar to a group.

With K‑Means we will try to find these four groups, which do exist but are unknown to us.

Below is a plot of the four groups:

Real Problem:

We are the Data Scientist of a prestigious mobile phone company (personal experience of the writer) and we are tasked with segmenting/classifying customers (again, nothing far from the writer’s reality). When reviewing the database, we realize that we only have the monthly data and voice consumption of each customer (there could be many more: top‑ups, text messages, calls to customer service, etc.).

Since the scales of data consumption (MB) and voice consumption (minutes) are very different, and K‑Means measures similarity by Euclidean distance between cases, we will scale the values between 0 and 1, obtaining the following data:

Now we are ready to begin. Since K‑Means does not determine the number of groups, it is run for different numbers and the results of each case are compared. The choice of the number of groups lies somewhere between mathematics and judgment.

We will run K‑Means from 1 to 6 groups and use both the mathematical method and the "judgment" method to determine the number of groups.

The first thing to observe is the level of fit for each case. This is the percentage of variance explained by each segmentation, calculated as between_SS / total_SS (I won’t go into detail). When segmentation no longer improves significantly as the number of groups increases, it means the new group is similar to an existing one, so they should be the same.

Below is the fit plot for each case:

It seems that more than 4 groups does not make sense. Now let’s compare the aggregated values of each group: we will look at their average values and the number of elements in each group.

Now we will proceed with the non‑numerical analysis, which consists of naming the groups. When two groups cannot be differentiated, it means there is no point in separating them (just out of curiosity, we will compare them with the "real" groups).

1 Group

group_number	min	mb	number
1) Average of all	71	447	120

2 Groups

group_number	min	mb	number
1) Few Minutes	9	538	60
2) Many Minutes	134	356	60

3 Groups

group_number	min	mb	number
1) Low Traffic	13	111	31
2) Many Minutes	134	356	60
3) Many MBs and few Minutes	5	994	29

4 Groups

group_number	min	mb	number
1) Many MBs	5	994	29
2) Many Minutes	148	23	30
3) Low Traffic	13	111	31
4) High Traffic	120	690	30

5 Groups

group_number	min	mb	number
1) High Traffic	111	699	22
2) High Traffic, especially minutes*	143	666	8
3) Many MBs	5	994	29
4) Many Minutes	148	23	30
5) Low Traffic	13	111	31

Groups 1 and 2 are very similar and group 2 has very few customers; it makes no sense to separate them.

6 Groups

group_number	min	mb	number
1) High Minutes Consumption	148	23	30
2) Very High MB Consumption	3	1172	10
3) Low Consumption, a bit more MBs	9	288	10
4) High MB Consumption	6	917	18
5) Low Consumption	13	51	22
6) High Consumption	120	690	30

Groups 1, 2 and 4 are very similar, basically characterized by customers who use MBs but very few minutes.

Therefore, there are 4 groups, which we will name:

group_number	min	mb	number
MB User	5	994	29
Minutes User	148	23	30
Low Consumption	13	111	31
High Consumption	120	690	30

Now that we have the segmentation, let’s compare the initial data with the obtained one. In the following plot, the color represents the group found by K‑Means and the shape of the point represents the original group to which the customer belonged (according to the generated data):

Indeed, K‑Means detected almost all the initially generated groups.

A corollary: this is how a mobile operator redefined its products. We realized that the competition had products aimed at high‑consumption and low‑consumption segments, while we noticed that we had customers who mainly used data or voice. So we decided to focus on this segment that had no competition.

If you liked this article, I invite you to read about:

Greetings!

If you liked it, follow us on any of our channels; all posts will appear there.

And don’t forget to share on your social media; your visits are my motivation.

Appendix: The R code used for the analysis:

set.seed(1984)
library(ggplot2)
library(plyr)
library(xtable)

# Generate sample
muestra = rbind(
  data.frame(min = rnorm(30,10,10), mb = rnorm(30,100,100), grupo = "bajo"), # Customers with average consumption
  data.frame(min = rnorm(30,120,20), mb = rnorm(30,700,100), grupo = "alto"), # Customers with average consumption
  data.frame(min = rnorm(30,3,10), mb = rnorm(30,1000,200), grupo = "datos"), # Customers with high data consumption
  data.frame(min = rnorm(30,150,20), mb = rnorm(30,15,60), grupo = "voz") # Customers with high voice consumption
)
muestra[,1:2] = apply(muestra[,1:2],2,function(x) ifelse(x<0,0,x))
qplot(min,mb,data=muestra,xlab = "Monthly Minutes", ylab = "Monthly MB", color = grupo)

# Normalization of data
muestra = transform(muestra, n_min = min/max(min), n_mb = mb/max(mb))
qplot(n_min,n_mb,data=muestra,xlab = "Normalized Monthly Minutes", ylab = "Normalized Monthly MB")

# Run the model for 6 cases
codo = data.frame()
set.seed(2016)
for(grupos in 1:6){
  modelo = kmeans(muestra[,4:5],grupos, iter.max = 100)
  codo = rbind(codo,
               data.frame(grupos = grupos,
                          between_SS = modelo$betweenss,
                          total_ss = modelo$totss,
                          tot.withinss = modelo$tot.withinss,
                          value =  modelo$betweenss/modelo$totss)
               )
  muestra[,paste0("kmeans_",grupos)] = as.character(modelo$cluster)
}

# Plot of the fit level
qplot(x = grupos, y = value, data = codo, geom="line", ylab = "Percentage of Fit", xlab = "Number of groups")

# Summary for each model
resumen = data.frame()
for(n in 1:6){
  tabla = ddply(muestra, paste0("kmeans_",n), function(x) data.frame(min = mean(x$min), mb = mean(x$mb), numero = nrow(x)) )
  colnames(tabla)[1] = "grupo_numero"
  resumen = rbind(resumen,
                  data.frame(numero_de_grupos = n, tabla)
                  )
  print(xtable(tabla,digits = 0), type="HTML", include.rownames=FALSE)
}
print(xtable(resumen,digits = 0), type="HTML", include.rownames=FALSE)

# Plot of each group
qplot(min,mb,data=muestra,xlab = "Monthly Minutes", ylab = "Monthly MB", color = kmeans_6, shape = grupo)

Geekosas

ASDFantastinc!

Segment customers step by step

Be the first to comment

Leave a Reply Cancel reply

Share with your friends!

Be the first to comment

Leave a Reply Cancel reply