Uncovering averages Part 2

A few days ago I wrote the article Uncovering Averages which basically did was to open an average value into factors using trees. Please read the article before continuing.

In that example analysis I created the dataset and therefore knew exactly where the change was, which was in the cause column, only for the connectivity case, but since the tree did not split on that dimension, it only left a "clue about where to look".

Since this Data Science world is a mix of obsession and creativity, I kept thinking and thinking about how to make all the factors appear until I hit the nail using linear regression.

The methodology is the same; what changes is that the beta corresponds to the average rating and the average of each predictor variable corresponds to the rate.

First we load the datasets, which look like this:

library(tidyverse)
library(broom)
library(caret)
data1 = readRDS('dataset1.rds')
data2 = readRDS('dataset2.rds')
> head(data1)
  id        causa genero region nota
1  1       equipo hombre  norte    6
2  2        saldo  mujer  norte    8
3  3  facturacion hombre  norte    2
4  4        saldo  mujer centro    6
5  5 conectividad  mujer centro    9
6  6 conectividad hombre centro    4

Then we dummy‑code the variables so that the categories are redundant:

encoder <- dummyVars( ~ nota + causa + genero + region, data1)
dataset1 <- predict(encoder, data1)
dataset2 <- predict(encoder, data2)

And we run a model for each period and use tidy from broom to format the regression parameters as a data frame (note that tidy removes the parameters that could not be estimated due to linear dependence; essentially they are contrasts and the estimate is 0 – we will need that later).

fit1 = lm(nota ~ .,data.frame(dataset1))
fit2 = lm(nota ~ .,data.frame(dataset2))
params1 = tidy(fit1) %>% select(term,estimate)
params2 = tidy(fit2) %>% select(term,estimate)
params1
> params1
  term              estimate
1 (Intercept)         7.87  
2 causaconectividad  -1.98  
3 causaequipo        -0.844
4 causafacturacion   -2.88  
5 generohombre       -1.02  
6 regioncentro       -0.925
7 regionnorte         0.0486

Now we calculate the proportion (average value) of each column in the dummy‑coded dataset:

porciones1 = dataset1 %>% data.frame() %>% summarise_all(mean) %>% gather(term,porcion)
porciones2 = dataset2 %>% data.frame() %>% summarise_all(mean) %>% gather(term,porcion)
porciones1
                term porcion
1               nota   5.577
2  causaconectividad   0.293
3        causaequipo   0.216
4   causafacturacion   0.101
5         causasaldo   0.390
6       generohombre   0.704
7        generomujer   0.296
8       regioncentro   0.576
9        regionnorte   0.213
10         regionsur   0.211

And we cross the above datasets, filling the contrast betas with 0 and for the intercept the proportion will be 1 because it affects all elements:

dataset = params1 %>%
  left_join(params2, by = "term",suffix = c("_1","_2")) %>%
  full_join(porciones1,by = "term") %>%
  left_join(porciones2,by = "term",suffix = c("_1","_2")) %>%
  filter(term!="nota") %>%
  mutate_at(vars(starts_with("e")),~replace_na(.,0)) %>%
  mutate_at(vars(starts_with("p")),~replace_na(.,1))
> dataset
   term              estimate_1 estimate_2 porcion_1 porcion_2
1 (Intercept)           7.87        7.80      1        1     
2 causaconectividad    -1.98       -2.87      0.293    0.304
3 causaequipo          -0.844      -0.966     0.216    0.212
4 causafacturacion     -2.88       -2.98      0.101    0.0985
5 generohombre         -1.02       -1.00      0.704    0.290
6 regioncentro         -0.925      -0.864     0.576    0.586
7 regionnorte           0.0486      0.174     0.213    0.201
8 causasaldo            0           0         0.39     0.386
9 generomujer           0           0         0.296    0.71  
10 regionsur             0           0         0.211    0.213

Note that the product of the estimate column and the proportion column gives the average, so we can say we have decomposed the average into factors:

sum(dataset$estimate_1 * dataset$porcion_1)
[1] 5.577
> mean(data1$nota)
[1] 5.577
> sum(dataset$estimate_2 * dataset$porcion_2)
[1] 5.6705
> mean(data2$nota)
[1] 5.6705

The above holds because the sum of squared residuals in linear regression is zero.

Now we use the same trick as in the previous article, calculating the change in beta and the change in the sample, and with that we determine the contributions:

dataset = dataset %>%
  mutate(
    delta_freq = porcion_2 - porcion_1,
    delta_nota = estimate_2 - estimate_1
  )
dataset = dataset %>%
  mutate(aporte_dfreq = estimate_1 * delta_freq,
         aporte_dnota =  porcion_2 * delta_nota
  )

What is interesting is that the sum of the contributions equals the change in rating:

sum(dataset$aporte_dnota) + sum(dataset$aporte_dfreq)
[1] 0.0935
> mean(data2$nota) - mean(data1$nota)
[1] 0.0935

Therefore we can say that we have decomposed the average, and just like in the previous result, if we look at how much the average changed due to rating and due to sample, we get the following:

sum(dataset$aporte_dfreq)
[1] 0.4012351
> sum(dataset$aporte_dnota)
[1] -0.3077351

Result similar to the previous article.

On the other hand, going back to the contribution table, we can see that a 41% decrease in the number of men surveyed caused an increase of 0.42 in the rating, which cancels out the 0.27‑point decrease caused by the worse evaluation of the connectivity cases.

Finally, this model seems superior to the tree one, but to perform linear regressions, certain assumptions are needed that are not required for trees; therefore, we must be a little more careful.

I hope this is useful, and I’ll be checking the comments to see if there’s anything you want me to address.

Cheers!