Previously I wrote about neural networks (click here to see it). Neural networks and all other "supervised methods" are used when you have a sample of values to predict. But when you know what you want to achieve but do not have a sample of the value to predict, the so‑called "unsupervised methods" are used.
A classic problem where this kind of method is applied is customer segmentation, where the segments/groups are not known in advance. Among the methods, one of the most famous is K‑Means.
K‑Means is an algorithm used to find groups of individuals with similar characteristics. Similarity or difference is calculated based on the Euclidean distance of their numerical attributes.
Below we will follow a step‑by‑step example of how to segment customers in a simplified case.
To make this example, we will generate a random sample of data in two dimensions per case: data traffic (megabytes) and voice traffic (minutes). Each case has been simulated so that it follows a behavior pattern similar to a group.
With K‑Means we will try to find these four groups, which do exist but are unknown to us.
Below is a plot of the four groups:

Real Problem:
We are the Data Scientist of a prestigious mobile phone company (personal experience of the writer) and we are tasked with segmenting/classifying customers (again, nothing far from the writer’s reality). When reviewing the database, we realize that we only have the monthly data and voice consumption of each customer (there could be many more: top‑ups, text messages, calls to customer service, etc.).
Since the scales of data consumption (MB) and voice consumption (minutes) are very different, and K‑Means measures similarity by Euclidean distance between cases, we will scale the values between 0 and 1, obtaining the following data:

Now we are ready to begin. Since K‑Means does not determine the number of groups, it is run for different numbers and the results of each case are compared. The choice of the number of groups lies somewhere between mathematics and judgment.
We will run K‑Means from 1 to 6 groups and use both the mathematical method and the "judgment" method to determine the number of groups.
The first thing to observe is the level of fit for each case. This is the percentage of variance explained by each segmentation, calculated as between_SS / total_SS (I won’t go into detail). When segmentation no longer improves significantly as the number of groups increases, it means the new group is similar to an existing one, so they should be the same.
Below is the fit plot for each case:

It seems that more than 4 groups does not make sense. Now let’s compare the aggregated values of each group: we will look at their average values and the number of elements in each group.
Now we will proceed with the non‑numerical analysis, which consists of naming the groups. When two groups cannot be differentiated, it means there is no point in separating them (just out of curiosity, we will compare them with the "real" groups).
1 Group

| group_number | min | mb | number |
|---|---|---|---|
| 1) Average of all | 71 | 447 | 120 |
2 Groups

| group_number | min | mb | number |
|---|---|---|---|
| 1) Few Minutes | 9 | 538 | 60 |
| 2) Many Minutes | 134 | 356 | 60 |
3 Groups

| group_number | min | mb | number |
|---|---|---|---|
| 1) Low Traffic | 13 | 111 | 31 |
| 2) Many Minutes | 134 | 356 | 60 |
| 3) Many MBs and few Minutes | 5 | 994 | 29 |
4 Groups

| group_number | min | mb | number |
|---|---|---|---|
| 1) Many MBs | 5 | 994 | 29 |
| 2) Many Minutes | 148 | 23 | 30 |
| 3) Low Traffic | 13 | 111 | 31 |
| 4) High Traffic | 120 | 690 | 30 |
5 Groups

| group_number | min | mb | number |
|---|---|---|---|
| 1) High Traffic | 111 | 699 | 22 |
| 2) High Traffic, especially minutes* | 143 | 666 | 8 |
| 3) Many MBs | 5 | 994 | 29 |
| 4) Many Minutes | 148 | 23 | 30 |
| 5) Low Traffic | 13 | 111 | 31 |
- Groups 1 and 2 are very similar and group 2 has very few customers; it makes no sense to separate them.
6 Groups

| group_number | min | mb | number |
|---|---|---|---|
| 1) High Minutes Consumption | 148 | 23 | 30 |
| 2) Very High MB Consumption | 3 | 1172 | 10 |
| 3) Low Consumption, a bit more MBs | 9 | 288 | 10 |
| 4) High MB Consumption | 6 | 917 | 18 |
| 5) Low Consumption | 13 | 51 | 22 |
| 6) High Consumption | 120 | 690 | 30 |
- Groups 1, 2 and 4 are very similar, basically characterized by customers who use MBs but very few minutes.
Therefore, there are 4 groups, which we will name:
| group_number | min | mb | number |
|---|---|---|---|
| MB User | 5 | 994 | 29 |
| Minutes User | 148 | 23 | 30 |
| Low Consumption | 13 | 111 | 31 |
| High Consumption | 120 | 690 | 30 |
Now that we have the segmentation, let’s compare the initial data with the obtained one. In the following plot, the color represents the group found by K‑Means and the shape of the point represents the original group to which the customer belonged (according to the generated data):

Indeed, K‑Means detected almost all the initially generated groups.
A corollary: this is how a mobile operator redefined its products. We realized that the competition had products aimed at high‑consumption and low‑consumption segments, while we noticed that we had customers who mainly used data or voice. So we decided to focus on this segment that had no competition.
If you liked this article, I invite you to read about:
Greetings!
If you liked it, follow us on any of our channels; all posts will appear there.
And don’t forget to share on your social media; your visits are my motivation.
Appendix: The R code used for the analysis:
set.seed(1984)
library(ggplot2)
library(plyr)
library(xtable)
# Generate sample
muestra = rbind(
data.frame(min = rnorm(30,10,10), mb = rnorm(30,100,100), grupo = "bajo"), # Customers with average consumption
data.frame(min = rnorm(30,120,20), mb = rnorm(30,700,100), grupo = "alto"), # Customers with average consumption
data.frame(min = rnorm(30,3,10), mb = rnorm(30,1000,200), grupo = "datos"), # Customers with high data consumption
data.frame(min = rnorm(30,150,20), mb = rnorm(30,15,60), grupo = "voz") # Customers with high voice consumption
)
muestra[,1:2] = apply(muestra[,1:2],2,function(x) ifelse(x<0,0,x))
qplot(min,mb,data=muestra,xlab = "Monthly Minutes", ylab = "Monthly MB", color = grupo)
# Normalization of data
muestra = transform(muestra, n_min = min/max(min), n_mb = mb/max(mb))
qplot(n_min,n_mb,data=muestra,xlab = "Normalized Monthly Minutes", ylab = "Normalized Monthly MB")
# Run the model for 6 cases
codo = data.frame()
set.seed(2016)
for(grupos in 1:6){
modelo = kmeans(muestra[,4:5],grupos, iter.max = 100)
codo = rbind(codo,
data.frame(grupos = grupos,
between_SS = modelo$betweenss,
total_ss = modelo$totss,
tot.withinss = modelo$tot.withinss,
value = modelo$betweenss/modelo$totss)
)
muestra[,paste0("kmeans_",grupos)] = as.character(modelo$cluster)
}
# Plot of the fit level
qplot(x = grupos, y = value, data = codo, geom="line", ylab = "Percentage of Fit", xlab = "Number of groups")
# Summary for each model
resumen = data.frame()
for(n in 1:6){
tabla = ddply(muestra, paste0("kmeans_",n), function(x) data.frame(min = mean(x$min), mb = mean(x$mb), numero = nrow(x)) )
colnames(tabla)[1] = "grupo_numero"
resumen = rbind(resumen,
data.frame(numero_de_grupos = n, tabla)
)
print(xtable(tabla,digits = 0), type="HTML", include.rownames=FALSE)
}
print(xtable(resumen,digits = 0), type="HTML", include.rownames=FALSE)
# Plot of each group
qplot(min,mb,data=muestra,xlab = "Monthly Minutes", ylab = "Monthly MB", color = kmeans_6, shape = grupo)

Leave a Reply