Separate Effects and Cohort Analysis

In subscription businesses (Newspapers, Cell Phones, Insurance, etc…), the business is always the same: acquire a customer and then receive cash flows associated with a service provided by the company. The day the customer cancels the service is called CHURN, and the customer becomes inactive, suspending both revenue and service.

In general, there is a cost to acquire a new customer, so the present value of the acquisition will depend on how long the customer takes to CHURN. In the following image, we show in red the present value of a customer for each number of survival months, while in blue the cash flows, where the first flow is negative, representing for example: sales commissions, service installation, equipment subsidies, etc. Clearly, if the customer does not last at least 15 months, the business was bad and represented a loss for the company.

library(ggplot2)
periods = 36
discount_rate = 1.06
cashflows = data.frame(period = 0:periods, cashflow = c(-10,rep(1,periods) )) %>%
  mutate(cashflow_present_value = cashflow/(discount_rate^period),
         present_value = cumsum(cashflow_present_value))
ggplot(cashflows) +
  geom_bar(aes(period,cashflow),stat = "identity",fill = "blue") +
  geom_line(aes(period,present_value),color = "red", size = 1.5)

To monitor and project customer behavior, what is called a cohort analysis is performed. Broadly speaking, it consists of grouping all customers acquired in a time window (usually monthly) and using it as a statistically significant representative of the group’s behavior. In this way, you can compare the average behavior of acquisitions from different months and make some projection models or simply compare the performance of one month versus other months.

The KPIs monitored month by month throughout the life of a cohort are the following:

Activations: Number of new customers acquired in the month
Average Revenue: Average revenue generated by a customer who is still subscribed to the service
Survival Rate: Percentage of customers who have not yet CHURNed

The interesting thing is that by multiplying the 3 above indicators, you obtain the total revenue of the cohort for a month. Therefore, if a good model is achieved to estimate them, future cash flows of a company or the present value of a new customer can be accurately estimated.

In this case, we will see a cohort analysis carried out for the donors of Techo para Chile in which we analyzed these 3 KPIs and decomposed them into factors. In this case, because I must maintain the anonymity of the donors and company data, I will share only the codes and the graphs without scale, but not the dataset.

First, let’s initialize the working environment by setting some variables, loading libraries, and creating functions.

options(stringsAsFactors = FALSE)
options(dplyr.width = Inf)
options(dplyr.print_max = 100)
Sys.setenv(TZ='GMT')
library(compiler)
library(tidyverse)
library(lubridate)
library(broom)
library(plotly)
moda = function(x){
  names(which.max(table(x)))
}
isnull = function(x,reemplazo,Nulos = c(NA,Inf,-Inf,NULL,NaN)){
  ifelse(x %in% Nulos,reemplazo,x)
}
datasetRaw = readRDS("../3. dataset/dataset_cohort.rds")

The received dataset is a table with the cash flows from 2016 and 2017. The trick to make this analysis easier is to make the months in which donors do not make their contribution appear and fill them with a 0. On the other hand, we will keep only the contributions from the first 36 months of the donors’ life.

grilla = group_by(datasetRaw,nro_contrato,mes_activacion,monto_inicial,banco_t0,region,banco_t0, origen_pago_t0, hombre) %>%
  summarise() %>%
  crossing(meses = 1:36) %>%
  mutate(periodo = months(meses-1) + mes_activacion) %>%
  filter(periodo >= as.Date("2016-01-01") &  periodo <= as.Date("2017-12-01"))

dataset = grilla %>%  
  left_join(datasetRaw) %>%
  mutate(hombre = as.numeric(hombre),
         monto = replace_na(monto,0)) %>%
  group_by(nro_contrato) %>%
  arrange(meses) %>%
  mutate(meses_activo = meses - min(c(meses[monto>0],9999))[1],
         meses_activo = replace(meses_activo,meses_activo <= 1,NA))

KPI Activations
The activations or donor acquisitions are not very constant; they can be seen by region and bank respectively:

tmp = filter(dataset, meses == 1)
gr = ggplot(tmp,aes(mes_activacion,fill = region)) + geom_bar()
ggplotly(gr,width = 800)

tmp = filter(dataset, meses == 1)
gr = ggplot(tmp,aes(mes_activacion,fill = banco_t0)) + geom_bar()
ggplotly(gr,width = 800)

KPI Average Revenue
The natural thing is to see a histogram of the initial donation for the time window and extract some statistics such as mean ($4,569) and median ($5,000).

dataset_t0 = dataset  %>%
    group_by(nro_contrato,mes_activacion,region,hombre, banco_t0) %>%  
    summarise(monto_inicial = mean(monto_inicial)) %>%
    filter(dataset_t0, monto_inicial <= 20000) # Remove some outliers
gr = ggplot(dataset_t0,aes(monto_inicial)) + geom_histogram(bins = 30)
ggplotly(gr,width = 800)

What is really interesting is to understand why the donations are different and separate the factors, in order to target better donors in the future. In this case, we will analyze with the attributes we have: gender, region, and bank.
The decomposition will be done with a linear regression with dummy variables, where the betas will represent the contribution of each attribute to the average donation.
Due to correlation, for each of the dimensions or attributes, the first category will disappear, which is interpreted as the weights being relative to the value of the missing attribute. On the other hand, we will omit the significance analysis.

We can see that those who donate using the BICE bank have an average contribution $1,500 higher than the reference level, so it should be prioritized against Movistar and Presto which are $3,000 below the reference level. On the other hand, it is convenient to run fundraising campaigns in the 2nd and 11th regions where donors give more.

tmp = dataset %>% filter(monto>0 & monto <=20000)
fit_monto_t0 = lm(monto_inicial ~ region + hombre + origen_pago_t0, tmp)
gr_data = tidy(fit_monto_t0)[-1,]
gr = ggplot(gr_data,aes(term,estimate)) +
  geom_bar(stat="identity")+  
  theme(axis.text.x = element_text(angle = 45,hjust = 1))
ggplotly(gr,width = 800)

KPI Survival Rate
This is the most complex KPI because it evolves over time.
The equivalent of the average donation histogram is the average survival rate graph, which looks like this for the first 15 months:

tmp = dataset %>%
  filter(mes_activacion >= as.Date("2016-01-01") & meses < 15) %>%
  filter(!is.na(meses_activo)) %>%
  group_by(meses) %>%
  summarise(supervivencia = mean(monto >0))
gr = ggplot(tmp,aes(meses,supervivencia)) +geom_line()
ggplotly(gr,wifth=800)

The graph above speaks for itself. Up to month 15, customer loss is more or less linear, but we would like to separate that graph into factors: in this case, seasonality and number of months.
To achieve the decomposition, we will do a regression with dummy variables for each month and period, thus not assuming any shape for the curve. The betas associated with the period will represent the seasonality.

The graph by deseasonalized number of month is almost identical to the previous case, only the level changes, but I cannot show it:

tmp = dataset %>%
  filter(mes_activacion >= as.Date("2016-01-01") & meses < 15) %>%
  filter(!is.na(meses_activo)) %>%
  mutate(periodo = as.character(periodo),
         meses = as.character(meses),
         supervivencia = monto>0,
         meses_activo = as.character(meses_activo))
fit = lm(supervivencia ~ meses + periodo - 1, tmp)
fit_tidy = tidy(fit)
fit_meses = fit_tidy %>%
  filter(!grepl("periodo",term)) %>%
  # mutate(estimate = estimate+estimate[1]) %>%
  filter(grepl("meses",term)) %>%
  mutate(meses = as.numeric(str_remove(term,"meses")))
gr = ggplot(fit_meses,aes(meses, estimate)) + geom_line()
ggplotly(gr,wifth=800)

On the other hand, the seasonality graph shows that there are better months than others. In particular, the end of 2016 and beginning of 2017 were very bad, then there was a recovery that was lost in 2017/10.

fit_periodo = fit_tidy %>%
  filter(grepl("periodo",term)) %>%
  mutate(periodo = as.Date(str_remove(term,"periodo")))
gr = ggplot(fit_periodo,aes(periodo, estimate)) + geom_line()
ggplotly(gr,wifth=800)

Finally, like in the decomposition of average donation, we will perform a decomposition of survival. To do this, we must assume a shape for the survival curve, which in this case will be a straight line.
We are going to estimate the contribution to the slope of the survival curve of each of the factors we have to analyze. The trick is to reflect the shape of the curve in the dummy variables of the model; in this case, since it is linear, we will put the month number.

The banks Security, Itau, and BICE are those with the best survival rate, while Estado, Falabella, and Retail Cards are the customers who last the least. On the other hand, regions 11 and 15 are where donors last the longest.

tmp = dataset %>%
  filter(mes_activacion >= as.Date("2016-01-01") & meses < 15) %>%
  filter(!is.na(meses_activo)) %>%
  group_by(meses,periodo,region,hombre,banco_t0) %>%
  summarise(tasa_supervivencia = mean(monto>0), casos = n()) %>%
  filter(casos > 25) %>%
  ungroup() %>%
  mutate(periodo = as.character(periodo),
         supervivencia  = (monto > 0)*1)
mat = model.matrix(~supervivencia+  meses + periodo + region + hombre + banco_t0-1,tmp) %>% data.frame()
mat[,!grepl("periodo|supervivencia",colnames(mat))] = mat[,!grepl("periodo|supervivencia",colnames(mat))] *mat$meses
mat = select(mat,-meses)
fit = lm(supervivencia ~ . -1, as.data.frame(mat))
data_fit = tidy(fit) %>%
  filter(!str_detect(term,"periodo"))
gr = ggplot(data_fit,aes(term,estimate)) +
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 45,hjust = 1))+
  ylab("contribution to survival rate")+
  xlab("dimension")
ggplotly(gr,width = 800)

Conclusions
More than a conclusion, now what must be done is a model that uses the calculated parameters and values the average acquisition for the product cross of each of the modeled attributes. This will allow optimizing the location of people seeking donors. For example, clients of BICE bank appear to be good ones, so it may be a good idea to place a volunteer at the entrance of BICE bank.

Greetings!

Geekosas

ASDFantastinc!

Separate Effects and Cohort Analysis

Be the first to comment

Leave a Reply Cancel reply

Share with your friends!

Be the first to comment

Leave a Reply Cancel reply