As they say, averages hide many things. In the article gender-pay-gap-en-tecnologia we saw an analysis that showed how, for that data, the salary difference between men and women can be explained by factors other than gender.
Now we are going to look at a technique, based on machine learning, that is very simple to explain and communicate to "uncover" what lies beneath the averages.
Approach
Imagine you are the data scientist in the satisfaction area of a company and you are in charge of maintaining the rating your customers give to the company’s service (or some other KPI). This rating is obtained through a monthly sampling of customers who contacted the call center.
Today is the day, the new satisfaction survey has arrived, and your boss is eager to know how the work was done last month and Eureka! The average rating increased from 5.577 to 5.723, so everyone gets the bonus and goes out to lunch.
But what does that average hide? Did the rating really increase? Let’s see how to quickly perform this analysis.
Data
For each month (previous and current) we have a table with 2000 observations that looks like this (simulated data):
id causa genero region nota
1 1 equipo hombre norte 6
2 2 saldo mujer norte 8
3 3 facturacion hombre norte 2
4 4 saldo mujer centro 6
5 5 conectividad mujer centro 9
6 6 conectividad hombre centro 4
Model
To understand the variables that explain the rating, we will calibrate a tree using rpart that looks like this:
library(rpart)
library(rattle)
fit = rpart(nota ~ causa + genero + region,data1,cp = 0.015)
fancyRpartPlot(fit)

Basically it reads like this: In the first branch (top), the average is 5.6, but if we open that branch according to the reason for the call, when the reason is connectivity or billing, the rating drops to 4.4, otherwise the rating rises to 6.3.
Each of the previous branches opens again: the left branch by gender (of the customer), where men give a rating of 5.1 and women 5.2, while the right branch opens by the geographic area where the customer lives.
The intuition is the following: my total rating can change for 2 reasons:
- Because a leaf changed its rating.
- Because a leaf became more important (for example, if more women are surveyed, my rating should rise).
We will try to decompose the contributions into these 2 factors:
dataset1 = data.frame(data1,hoja = rpart.predict.leaves(fit,data1)) %>%
group_by(hoja) %>%
summarise(nota1 = mean(nota),desvest1 = sd(nota), freq1 = n())
dataset2 = data.frame(data2,hoja = rpart.predict.leaves(fit,data2)) %>%
group_by(hoja) %>%
summarise(nota2 = mean(nota), freq2 = n())
dataset = dataset1 %>%
left_join(dataset2) %>%
ungroup() %>%
mutate(peso1 = freq1/sum(freq1),
peso2 = freq2/sum(freq2))
dataset = dataset %>%
mutate(
delta_freq = (freq2 - freq1)/freq1,
delta_nota = nota2 - nota1,
pval = pnorm(-abs(delta_nota),0,desvest1/sqrt(freq1))
)
print(dataset)
# A tibble: 4 x 11
hoja nota1 desvest1 freq1 nota2 freq2 peso1 peso2 delta_freq delta_nota pval
<int> <dbl> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 4.09 1.63 560 3.47 221 0.28 0.110 -0.605 -0.620 1.40e-19
2 4 5.22 1.64 228 4.35 567 0.114 0.284 1.49 -0.865 8.67e-16
3 6 5.93 1.57 697 6.42 697 0.348 0.348 0 0.494 4.55e-17
4 7 6.88 1.65 515 7.25 515 0.258 0.258 0 0.373 1.36e- 7
In the resulting table above, the first row corresponds to the leftmost leaf; as you go down, you move to the right in the tree leaves. We can see that in leaves 3 and 4 (rows 1 and 2), there is a considerable decrease in the rating (column delta_nota = nota2 – nota1), which correspond to services related to billing and connectivity; moreover, a small test shows that this difference is statistically significant (column pval).
If we try to decompose the overall change in rating into the two factors: frequency and rating, we get the following result:
dataset = dataset %>%
mutate(aporte_dfreq = peso1 * nota1 * (delta_freq),
aporte_dnota = peso2 * delta_nota
)
dataset %>% select(-pval)
# A tibble: 4 x 12
hoja nota1 desvest1 freq1 nota2 freq2 peso1 peso2 delta_freq delta_nota aporte_dfreq aporte_dnota
<int> <dbl> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 4.09 1.63 560 3.47 221 0.28 0.110 -0.605 -0.620 -0.693 -0.0685
2 4 5.22 1.64 228 4.35 567 0.114 0.284 1.49 -0.865 0.885 -0.245
3 6 5.93 1.57 697 6.42 697 0.348 0.348 0 0.494 0 0.172
4 7 6.88 1.65 515 7.25 515 0.258 0.258 0 0.373 0 0.096
> #validation
> sum(dataset$aporte_dnota) + sum(dataset$aporte_dfreq)
[1] 0.1465
> mean(data2$nota) - mean(data1$nota)
[1] 0.1465
# Factor contributions
> sum(dataset$aporte_dfreq)
[1] 0.1921425
> sum(dataset$aporte_dnota)
[1] -0.04564248
Basically, the change in rating caused me a loss of -0.045 (column aporte_dnota) and the gain in the overall rating is due to the change in frequencies, which corresponds to 0.192 (column aporte_dfreq), mainly because there were more women in the sample.
Conclusion
We can go celebrate, because the bonus was indeed earned, but we need to see what happened with the connectivity and billing causes, since next month we might not benefit from an increase in women in the survey.
What we need to do is start by checking whether there has been a change in the normal service protocols for connectivity and/or billing, or even listen to some of the conversations to detect what is happening. The important thing is to correct the situation soon.
Cheers!

Leave a Reply