The Gender Pay Gap is the difference that exists on average in the salaries of Men vs. Women.
Today there are people who attribute this to discrimination, while others say it is due to the decisions that men on average make versus those of women.
Since both opinions have merit, I decided to conduct an analysis on some database that would allow me to reveal the reality. That’s when I came across the database from the survey that StackOverflow conducts every year. The interesting thing about it is that besides collecting gender and salary, it also collects many other attributes such as: favorite operating system, programming languages, years of experience, etc.
The survey is answered by people in the technology field from all over the world and is conducted annually, but we will focus on the data from the year 2018 for the United States, where the individuals are employees (not independent), with annual salaries between USD 50,000 and USD 200,000. This is because it is where the most surveys were answered and where the majority of salaries are concentrated.
In this analysis, I will only be guided by Advanced Analytics tools and interpret their results, but you, the readers, are free to debate the topic.
Evidence of the Salary Gap
To begin, I will start by showing that there is indeed a difference in the average annual salaries of men and women:
aux = dataset %>%
group_by(Male) %>%
summarise(salario = mean(ConvertedSalary,na.rm=T),
number = n()) %>%
filter(!is.na(Male))
ggplot(aux,(aes(Male,salario))) + geom_bar(stat = 'identity', fill = 'dark orange') + ggtitle('Salario Promedio por Genero') + scale_fill_tableau()

If we look at its distribution, we notice that women have a greater concentration in lower salaries:
ggplot(filter(dataset,!is.na(Male)),aes(ConvertedSalary,group = Male, fill = Male)) +
geom_density(alpha= 0.3,kernel = 'epanechnikov') +
ggtitle('Distribucion de Sueldo por Genero') +
scale_fill_tableau()

It is easier to see this effect if we look at the logarithm of salaries, where it is clear that women’s salaries are more to the left than men’s:
ggplot(filter(dataset,!is.na(Male)),aes(log10(ConvertedSalary),group = Male, fill = Male)) +
geom_density(alpha= 0.3,kernel='epanechnikov') +
ggtitle('Distribucion de Log Sueldo por Genero') +
scale_fill_tableau()

Finally, if we perform a one-tailed means test, it yields a p-value = 0, so it can be stated that the means are significantly different.
t.test(ConvertedSalary ~ Male, filter(dataset,!is.na(Male)),alternative = 'less')
t.test(ConvertedSalary ~ Male, filter(dataset,!is.na(Male)),alternative = 'less')
##
## Welch Two Sample t-test
##
## data: ConvertedSalary by Male
## t = -8.452, df = 917.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -7793.837
## sample estimates:
## mean in group FALSE mean in group TRUE
## 95684.52 105363.99
Analysis with Trees
To start, we will use Machine Learning to find the variables that best explain a programmer’s salary, to see if gender appears.
We can see that the most important variables are:
- YearsCodingProf: Years of professional coding? where more than 5 years means an increase in average salary.
- OperatingSystem: What operating system do you work on? where Using Windows means an increase in average salary.
- AssessBenefits2: “How willing are you to share company shares? 1 is more”, where people willing to have company shares have a higher average salary.
fitTree = rpart(ConvertedSalary ~ .,dataset)
fancyRpartPlot(fitTree)

Since gender does not appear in the tree, let’s see what differences exist between men and women in these three variables.
The first thing we can see is that on average women have almost 3 years less programming experience than men.
aux = dataset %>%
group_by(Male) %>%
summarise(YearsCodingProf = mean(YearsCodingProf,na.rm=T),
AssessBenefits2 = mean(AssessBenefits2,na.rm=T),
number = n()) %>%
filter(!is.na(Male))
ggplot(aux,(aes(Male,YearsCodingProf))) + geom_bar(stat = 'identity', fill = 'dark orange') + ggtitle('Años Promedio Programando')

On the other hand, women have a lower preference for having company shares:
ggplot(aux,(aes(Male,AssessBenefits2))) + geom_bar(stat = 'identity', fill = 'dark orange') + ggtitle('Preferencia por Acciones de Empresa, Menor es Mayor Preferencia')

And finally, men on average prefer Windows while women prefer Apple:
aux = dataset %>%
filter(!is.na(OperatingSystem) & !is.na(Male) ) %>%
group_by(Male,OperatingSystem) %>%
summarise(number = n()) %>%
group_by(Male) %>%
mutate(os_portion = number/sum(number))
ggplot(aux,aes(Male,os_portion, fill = OperatingSystem)) + geom_bar(position = 'stack',stat = 'identity') +
ggtitle('Porcion de Sistema operativo por Genero') + scale_fill_tableau()

Now if we try to reverse the problem and try to predict gender based on the variables to see how they differ (mainly to see if salary appears), we can see that:
- The variable that best discriminates is AssessJob9, Women prefer inclusive work environments:
- On average women have fewer years programming (professionally or non-professionally) than men.
- Finally Hobby represents whether the respondent programs as a hobby or not, where men on average do it more often.
fitTree = rpart(Male ~ .,dataset, cp = 0.008)
fancyRpartPlot(fitTree)

Using Random Forest
Random Forest will allow us 2 things: measure the importance of variables in the model (it does this by measuring how much the model worsens when the variable is not included in the tree) and on the other hand, simulate how much a person’s salary changes if their gender changes.
When performing the variable importance analysis, the first thing we detect is that when trying to predict salary, gender does not appear among the first 10 variables, remaining approximately in position 60. For analysis, we plot the 10 most important in magnitude.
Three of the four most important variables are related to years, whether of life or programming. We see AssessBenefits2 again and people with the position of Engineering Manager appear. After that, the variables are already of little relevance so we won’t look at them.
srv = h2o.init(nthreads = -1)
h2o.aux = mutate_if(dataset,is.ordered,as.character)
dataset.h2o = as.h2o(h2o.aux,destination_frame = "dataset_train")
salary_fit = h2o.randomForest(y = "ConvertedSalary",
x = setdiff(colnames(dataset),"ConvertedSalary"),
training_frame = dataset.h2o,
model_id = "salary_balanced",
ntrees = 1000)
h2o.saveModel(salary_fit,'data/models/',force = T)
importance = data.frame(h2o.varimp(salary_fit))
importance = head(importance,10)
ggplot(importance,aes(variable,scaled_importance)) +
geom_bar(stat = 'identity',fill='dark orange') +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

The first interesting thing is that men are older on average than women, a little more than 1.5 years older:
aux = dataset %>%
group_by(Male) %>%
summarise(YearsCoding = mean(YearsCoding,na.rm=T),
YearsCodingProf = mean(YearsCodingProf,na.rm=T),
Age = mean(Age,na.rm=T),
number = n()) %>%
filter(!is.na(Male))
ggplot(aux,aes(Male,Age)) + geom_bar(stat = 'identity',fill='dark orange') + ggtitle('Edad por Genero')

Just like in years programming professionally, men have more years programming (professionally or non-professionally).
ggplot(aux,aes(Male,YearsCoding)) + geom_bar(stat = 'identity',fill='dark orange') + ggtitle('Años Programando por Genero')

Seeing the above, I wondered if the difference in years programming is due to the difference in age, so I plotted how many more years men have than women in the three temporal dimensions. We can see that this explains part, but not all. In particular, after adjusting for age, men have 2 more years programming and 1.5 more years of professional experience.
aux = aux %>%
select(-number,-Male) %>%
summarize_all(diff) %>%
gather(Variable,Difference)
ggplot(aux,aes(Variable,Difference)) + geom_bar(stat='identity',fill='dark orange') +ggtitle('Diferencia entre años Hombre y Mujer')

If we look at years of experience vs. age, there is a gap that starts at 20, stabilizes at 30, and grows much more at 40.
aux = dataset %>%
filter(!is.na(Male) & !is.na(Age) & Age > 20 & Age < 60) %>%
group_by(Male,Age) %>%
summarize(YearsCoding = mean(YearsCoding,na.rm = T))
ggplot(aux,aes(Age,YearsCoding,group = Male, color = Male)) + geom_line(size=2) + ggtitle('Años de experiencia laboral vs Edad')

Finally, if we look at Salary vs. Years of experience by gender, we see that at the beginning women have marginally better salaries than men, but then it comparatively declines. (I removed points with less than 15 samples because at the end there was something weird in women’s salary, it dropped a lot).
aux = dataset %>%
filter(!is.na(Male) & !is.na(YearsCoding)) %>%
group_by(Male,YearsCoding) %>%
summarize(ConvertedSalary = mean(ConvertedSalary,na.rm = T),
q = n()) %>%
filter(q >15)
ggplot(aux,aes(YearsCoding,ConvertedSalary,group = Male, color = Male)) + geom_line(size=2) + ggtitle('Sueldo vs Años de experiencia Laboral')

The other variable that turned out to be relevant is having the position of Engineering Manager. These receive a higher income than the sample average:
aux = dataset %>%
mutate(DevType_Engineering.manager = DevType_Engineering.manager==1) %>%
group_by(DevType_Engineering.manager) %>%
summarise(ConvertedSalary = mean(ConvertedSalary,na.rm = T))
ggplot(aux,aes(DevType_Engineering.manager,ConvertedSalary)) + geom_bar(stat='identity',fill='dark orange') +ggtitle('Sueldo Promedio Engineering Manager vs Resto')

On the other hand, it is a position where the proportion of men who hold it relative to the total number of men is double the proportion of women in the position:
aux = dataset %>%
group_by(Male,DevType_Engineering.manager) %>%
summarise(number = n()) %>%
group_by(Male) %>%
mutate(portion = number/sum(number)) %>%
filter(!is.na(Male) & DevType_Engineering.manager)
ggplot(aux,aes(Male,portion)) + geom_bar(stat='identity',fill='dark orange') +ggtitle('Porcion de Engineering Manager por Genero')

Let’s see what attributes are needed to achieve this position, mainly to see if gender is a predictor attribute.
For this analysis, the variables ConvertedSalary, DevType_Product.manager, DevType_Csuite.executive.CEO.CTO.etc, and HopeFiveYears were removed because they appeared in the tree but were not attributes related to experience or gender.
We see again that having more than 5 years of experience is the most important variable, followed by having worked with AWS.
fitTree = rpart(DevType_Engineering.manager ~ .,select(dataset,
-ConvertedSalary,
-DevType_Product.manager,
-DevType_Csuite.executive.CEO.CTO.etc,
-HopeFiveYears), cp = 0.008)
fancyRpartPlot(fitTree)

Looking at who has worked with AWS, we see that men have more experience than women on this platform:
aux = dataset %>%
group_by(Male) %>%
summarise(PlatformWorkedWith_AWS = mean(PlatformWorkedWith_AWS,na.rm=T),
number = n()) %>%
filter(!is.na(Male))
ggplot(aux,aes(Male,PlatformWorkedWith_AWS)) + geom_bar(stat = 'identity',fill='dark orange') + ggtitle('Trabajo con AWS por Genero')

Simulation
Finally, using RandomForest we predict the salary of the entire sample and then we change the gender of all people and predict again. There is a decrease of approximately 100 dollars per year when converting a man to a woman and also the reciprocal effect of the same magnitude when doing the opposite:
aux = dataset %>%
filter(!is.na(Male)) %>%
group_by(Male) %>%
summarize(estimado = mean(as.numeric(estimado)),
sexo_opuesto = mean(as.numeric(sexo_opuesto))
)
tidy_aux = aux %>% gather(escenario,sueldo_promedio,-Male)
ggplot(tidy_aux,aes(Male,sueldo_promedio,fill= escenario)) + geom_bar(stat='identity', position='dodge') +ggtitle('Sueldo estimado vs Sueldo Estimado con genero Opuesto')

The detail of the graph:
## # A tibble: 2 x 3
## Male estimado sexo_opuesto
## <lgl> <dbl> <dbl>
## 1 FALSE 96298. 96351.
## 2 TRUE 105451. 105325.
Conclusions
In all the variables that explain salary, men on average make more profitable decisions than women, such as having used AWS, using Windows, participating in company shares, or starting to program earlier, possibly as a hobby. But the most important of all is the number of years of work experience, which in the case of women, for the same age, they have less experience. On the other hand, if we look at salary vs. years of experience, men and women start on an equal footing, but over time men increase their salary more from 10 years of experience onward.
Using the data collected, I lean towards the position that the lower salary is due to the decisions women make from 10 years of experience onward, a point which corresponds to when on average they are 40 years old and surely already have a family and want to be with them. Although it’s an assumption since that family data is not in the dataset, something certainly happens around the age of 40.
I would like to see how you conclude in the comments. I will be reading them. I would like a healthy discussion to arise among the readers. If you want to see the data, it’s on my S3 https://s3.console.aws.amazon.com/s3/buckets/danielfm123-public/proyects/salary-gap/ or at the original source https://insights.stackoverflow.com/survey
As always, the codes on github: https://github.com/danielfm123/salary-gap
If you liked the article, you can follow me on facebook.com/geekosas or sign up to receive emails when I publish new articles.
I also wanted to thank Christian Villarroel for participating in the writing of this sensitive article.

Leave a Reply