{"id":3347,"date":"2019-05-23T19:22:44","date_gmt":"2019-05-23T17:22:44","guid":{"rendered":"https:\/\/geekosas.com\/?p=3347"},"modified":"2026-05-23T19:23:07","modified_gmt":"2026-05-23T17:23:07","slug":"uncovering-averages","status":"publish","type":"post","link":"https:\/\/geekosas.com\/index.php\/2019\/05\/23\/uncovering-averages\/","title":{"rendered":"Uncovering averages"},"content":{"rendered":"<p>As they say, averages hide many things. In the article <a href=\"https:\/\/www.geekosas.com\/index.php\/2019\/02\/20\/gender-pay-gap-en-tecnologia\/\">gender-pay-gap-en-tecnologia<\/a> we saw an analysis that showed how, for that data, the salary difference between men and women can be explained by factors other than gender.<\/p>\n<p>Now we are going to look at a technique, based on machine learning, that is very simple to explain and communicate to &quot;uncover&quot; what lies beneath the averages.<\/p>\n<h4>Approach<\/h4>\n<p>Imagine you are the data scientist in the satisfaction area of a company and you are in charge of maintaining the rating your customers give to the company&#8217;s service (or some other KPI). This rating is obtained through a monthly sampling of customers who contacted the call center.<\/p>\n<p>Today is the day, the new satisfaction survey has arrived, and your boss is eager to know how the work was done last month and Eureka! The average rating increased from 5.577 to 5.723, so everyone gets the bonus and goes out to lunch.<\/p>\n<p>But what does that average hide? Did the rating really increase? Let&#8217;s see how to quickly perform this analysis.<\/p>\n<h4>Data<\/h4>\n<p>For each month (previous and current) we have a table with 2000 observations that looks like this (simulated data):<\/p>\n<pre><code class=\"language-r\">id        causa genero region nota\n1  1       equipo hombre  norte    6\n2  2        saldo  mujer  norte    8\n3  3  facturacion hombre  norte    2\n4  4        saldo  mujer centro    6\n5  5 conectividad  mujer centro    9\n6  6 conectividad hombre centro    4<\/code><\/pre>\n<h4>Model<\/h4>\n<p>To understand the variables that explain the rating, we will calibrate a tree using rpart that looks like this:<\/p>\n<pre><code class=\"language-r\">library(rpart)\nlibrary(rattle)\nfit = rpart(nota ~ causa + genero + region,data1,cp = 0.015)\nfancyRpartPlot(fit)<\/code><\/pre>\n\n<figure class=\"wp-block-image\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" width=\"501\" height=\"373\" data-attachment-id=\"2747\" data-permalink=\"https:\/\/geekosas.com\/index.php\/es\/2019\/08\/04\/destapando-los-promedios\/rpart\/\" data-orig-file=\"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/08\/rpart.png?fit=501%2C373&amp;ssl=1\" data-orig-size=\"501,373\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"rpart\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/08\/rpart.png?fit=501%2C373&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2019\/08\/rpart.png?resize=501%2C373&#038;ssl=1\" alt=\"\" class=\"wp-image-2747\" srcset=\"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/08\/rpart.png?w=501&amp;ssl=1 501w, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/08\/rpart.png?resize=300%2C223&amp;ssl=1 300w\" sizes=\"auto, (max-width: 501px) 100vw, 501px\" \/><\/figure>\n\n<p>Basically it reads like this: In the first branch (top), the average is 5.6, but if we open that branch according to the reason for the call, when the reason is connectivity or billing, the rating drops to 4.4, otherwise the rating rises to 6.3.<\/p>\n<p>Each of the previous branches opens again: the left branch by gender (of the customer), where men give a rating of 5.1 and women 5.2, while the right branch opens by the geographic area where the customer lives.<\/p>\n<p>The intuition is the following: my total rating can change for 2 reasons:<\/p>\n<ul>\n<li>Because a leaf changed its rating.<\/li>\n<li>Because a leaf became more important (for example, if more women are surveyed, my rating should rise).<\/li>\n<\/ul>\n<p>We will try to decompose the contributions into these 2 factors:<\/p>\n<pre><code class=\"language-r\">dataset1 = data.frame(data1,hoja = rpart.predict.leaves(fit,data1)) %&gt;%\n  group_by(hoja) %&gt;%\n  summarise(nota1 = mean(nota),desvest1 = sd(nota), freq1 = n())\ndataset2 = data.frame(data2,hoja = rpart.predict.leaves(fit,data2)) %&gt;%\n  group_by(hoja) %&gt;%\n  summarise(nota2 = mean(nota), freq2 = n())\ndataset = dataset1 %&gt;%\n  left_join(dataset2) %&gt;%\n  ungroup() %&gt;%\n  mutate(peso1 = freq1\/sum(freq1),\n         peso2 = freq2\/sum(freq2))\ndataset = dataset %&gt;%\n  mutate(\n    delta_freq = (freq2 - freq1)\/freq1,\n    delta_nota = nota2 - nota1,\n    pval = pnorm(-abs(delta_nota),0,desvest1\/sqrt(freq1))\n    )\nprint(dataset)\n# A tibble: 4 x 11\n   hoja nota1 desvest1 freq1 nota2 freq2 peso1 peso2 delta_freq delta_nota     pval\n  &lt;int&gt; &lt;dbl&gt;    &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;    &lt;dbl&gt;\n1     3  4.09     1.63   560  3.47   221 0.28  0.110     -0.605     -0.620 1.40e-19\n2     4  5.22     1.64   228  4.35   567 0.114 0.284      1.49      -0.865 8.67e-16\n3     6  5.93     1.57   697  6.42   697 0.348 0.348      0          0.494 4.55e-17\n4     7  6.88     1.65   515  7.25   515 0.258 0.258      0          0.373 1.36e- 7<\/code><\/pre>\n<p>In the resulting table above, the first row corresponds to the leftmost leaf; as you go down, you move to the right in the tree leaves. We can see that in leaves 3 and 4 (rows 1 and 2), there is a considerable decrease in the rating (column delta_nota = nota2 &#8211; nota1), which correspond to services related to billing and connectivity; moreover, a small test shows that this difference is statistically significant (column pval).<\/p>\n<p>If we try to decompose the overall change in rating into the two factors: frequency and rating, we get the following result:<\/p>\n<pre><code class=\"language-r\">dataset = dataset %&gt;%\n  mutate(aporte_dfreq = peso1 * nota1 * (delta_freq),\n         aporte_dnota = peso2 * delta_nota\n  )\ndataset %&gt;% select(-pval)\n# A tibble: 4 x 12\n   hoja nota1 desvest1 freq1 nota2 freq2 peso1 peso2 delta_freq delta_nota aporte_dfreq aporte_dnota\n  &lt;int&gt; &lt;dbl&gt;    &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;dbl&gt;      &lt;dbl&gt;      &lt;dbl&gt;        &lt;dbl&gt;        &lt;dbl&gt;\n1     3  4.09     1.63   560  3.47   221 0.28  0.110     -0.605     -0.620       -0.693      -0.0685\n2     4  5.22     1.64   228  4.35   567 0.114 0.284      1.49      -0.865        0.885      -0.245\n3     6  5.93     1.57   697  6.42   697 0.348 0.348      0          0.494        0           0.172\n4     7  6.88     1.65   515  7.25   515 0.258 0.258      0          0.373        0           0.096\n> #validation\n> sum(dataset$aporte_dnota) + sum(dataset$aporte_dfreq)\n[1] 0.1465\n> mean(data2$nota) - mean(data1$nota)\n[1] 0.1465\n# Factor contributions\n> sum(dataset$aporte_dfreq)\n[1] 0.1921425\n> sum(dataset$aporte_dnota)\n[1] -0.04564248<\/code><\/pre>\n<p>Basically, the change in rating caused me a loss of -0.045 (column aporte_dnota) and the gain in the overall rating is due to the change in frequencies, which corresponds to 0.192 (column aporte_dfreq), mainly because there were more women in the sample.<\/p>\n<h4>Conclusion<\/h4>\n<p>We can go celebrate, because the bonus was indeed earned, but we need to see what happened with the connectivity and billing causes, since next month we might not benefit from an increase in women in the survey.<\/p>\n<p>What we need to do is start by checking whether there has been a change in the normal service protocols for connectivity and\/or billing, or even listen to some of the conversations to detect what is happening. The important thing is to correct the situation soon.<\/p>\n<p>Cheers!<\/p>","protected":false},"excerpt":{"rendered":"<div class=\"mh-excerpt\"><p>As they say, averages hide many things. In the article gender-pay-gap-en-tecnologia we saw an analysis that showed how, for that data, the salary difference between <a class=\"mh-excerpt-more\" href=\"https:\/\/geekosas.com\/index.php\/2019\/05\/23\/uncovering-averages\/\" title=\"Uncovering averages\">[&#8230;]<\/a><\/p>\n<\/div>","protected":false},"author":1,"featured_media":2745,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-3347","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-sin-categoria"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/08\/distribution.png?fit=1200%2C767&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8vjqF-RZ","jetpack-related-posts":[{"id":3349,"url":"https:\/\/geekosas.com\/index.php\/2019\/05\/23\/uncovering-averages-part-2\/","url_meta":{"origin":3347,"position":0},"title":"Uncovering averages Part 2","author":"Daniel Fischer","date":"2019-05-23","format":false,"excerpt":"A few days ago I wrote the article Uncovering Averages which basically did was to open an average value into factors using trees. Please read the article before continuing. In that example analysis I created the dataset and therefore knew exactly where the change was, which was in the cause\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/08\/averages.jpeg?fit=500%2C345&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]},{"id":3339,"url":"https:\/\/geekosas.com\/index.php\/2019\/05\/23\/gender-pay-gap-in-technology\/","url_meta":{"origin":3347,"position":1},"title":"Gender Pay Gap in Technology","author":"Daniel Fischer","date":"2019-05-23","format":false,"excerpt":"The Gender Pay Gap is the difference that exists on average in the salaries of Men vs. Women. Today there are people who attribute this to discrimination, while others say it is due to the decisions that men on average make versus those of women. Since both opinions have merit,\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/02\/GenderPayGap-201803070107196681-20180404082357920.jpg?fit=619%2C413&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/02\/GenderPayGap-201803070107196681-20180404082357920.jpg?fit=619%2C413&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/02\/GenderPayGap-201803070107196681-20180404082357920.jpg?fit=619%2C413&ssl=1&resize=525%2C300 1.5x"},"classes":[]},{"id":3321,"url":"https:\/\/geekosas.com\/index.php\/2018\/05\/23\/separate-effects-and-cohort-analysis\/","url_meta":{"origin":3347,"position":2},"title":"Separate Effects and Cohort Analysis","author":"Daniel Fischer","date":"2018-05-23","format":false,"excerpt":"In subscription businesses (Newspapers, Cell Phones, Insurance, etc...), the business is always the same: acquire a customer and then receive cash flows associated with a service provided by the company. The day the customer cancels the service is called CHURN, and the customer becomes inactive, suspending both revenue and service.\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2018\/06\/pressent-value.png?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2018\/06\/pressent-value.png?resize=350%2C200 1x, https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2018\/06\/pressent-value.png?resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2018\/06\/pressent-value.png?resize=700%2C400 2x"},"classes":[]},{"id":3291,"url":"https:\/\/geekosas.com\/index.php\/2017\/05\/23\/movies-2016\/","url_meta":{"origin":3347,"position":3},"title":"Movies 2016","author":"Daniel Fischer","date":"2017-05-23","format":false,"excerpt":"Movies make us laugh, cry, and some... sleep, so I decided to do a small analysis on 2016 movies. As with Video Games and Data Science, we did web scraping from www.metacritic.com to generate a database, in which, for each movie we obtained the following information: Country of Origin Genres\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/03\/histogramas-300x120.png?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/03\/histogramas-300x120.png?resize=350%2C200 1x, https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/03\/histogramas-300x120.png?resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/03\/histogramas-300x120.png?resize=700%2C400 2x"},"classes":[]},{"id":3319,"url":"https:\/\/geekosas.com\/index.php\/2018\/05\/23\/have-video-games-gotten-worse\/","url_meta":{"origin":3347,"position":4},"title":"Have video games gotten worse?","author":"Daniel Fischer","date":"2018-05-23","format":false,"excerpt":"Introduction \/ Abstract A data scientist is one who manages to make data speak to them; it is basically a conversation, where you ask questions and the data answers. In this notebook I want to share my latest conversation with this dataset of scores assigned to different video games. The\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/04\/consoles-800x491.jpg?fit=800%2C491&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/04\/consoles-800x491.jpg?fit=800%2C491&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/04\/consoles-800x491.jpg?fit=800%2C491&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/04\/consoles-800x491.jpg?fit=800%2C491&ssl=1&resize=700%2C400 2x"},"classes":[]},{"id":3274,"url":"https:\/\/geekosas.com\/index.php\/2016\/05\/23\/segment-customers-step-by-step\/","url_meta":{"origin":3347,"position":5},"title":"Segment customers step by step","author":"Daniel Fischer","date":"2016-05-23","format":false,"excerpt":"Previously I wrote about neural networks (click here to see it). Neural networks and all other \"supervised methods\" are used when you have a sample of values to predict. But when you know what you want to achieve but do not have a sample of the value to predict, the\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/05\/kmenas6.png?fit=620%2C539&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/05\/kmenas6.png?fit=620%2C539&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/05\/kmenas6.png?fit=620%2C539&ssl=1&resize=525%2C300 1.5x"},"classes":[]}],"_links":{"self":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts\/3347","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/comments?post=3347"}],"version-history":[{"count":1,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts\/3347\/revisions"}],"predecessor-version":[{"id":3348,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts\/3347\/revisions\/3348"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/media\/2745"}],"wp:attachment":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/media?parent=3347"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/categories?post=3347"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/tags?post=3347"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}