{"id":3326,"date":"2018-05-23T18:32:09","date_gmt":"2018-05-23T16:32:09","guid":{"rendered":"https:\/\/geekosas.com\/?p=3326"},"modified":"2026-05-23T18:34:10","modified_gmt":"2026-05-23T16:34:10","slug":"write-in-aws-redshift-at-fill-speed","status":"publish","type":"post","link":"https:\/\/geekosas.com\/index.php\/2018\/05\/23\/write-in-aws-redshift-at-fill-speed\/","title":{"rendered":"Write in AWS Redshift at Fill Speed!"},"content":{"rendered":"<p>As many know, <a href=\"https:\/\/aws.amazon.com\/redshift\/\">Redshift<\/a> is a fork of <a href=\"https:\/\/www.postgresql.org\/\">Postgres<\/a> made by <a href=\"https:\/\/aws.amazon.com\/\">Amazon<\/a> to provide a Data Warehouse service. The big difference between these two products is that the former is a columnar and compressed database, while the latter is not.<\/p>\n<p>Columnar databases are very fast for aggregations and joins, but not so much for inserting records (do not expect more than 4 records per second with a normal insert in Redshift), so they generally have some non\u2011standard method. In the case of Redshift, the method is as follows:<\/p>\n<ol>\n<li>If the table does not exist in the database: create it.<\/li>\n<li>Save the table you want to load in S3, ideally in several CSV files compressed with gzip.<\/li>\n<li>From Redshift call the COPY command to read the files from S3 and load them into the database.<\/li>\n<\/ol>\n<p>As you can see, it is not a standard process, so <a href=\"https:\/\/github.com\/sicarul\">sicarul<\/a> created the library <a href=\"https:\/\/github.com\/sicarul\/redshiftTools\">redshiftTools<\/a> that performs all this process sequentially, but the library was quite slow: loading a dataset of 1.5 million records and 800 columns takes 21 minutes. That is why I decided to review the code to see if I could make some optimizations, among which I found:<\/p>\n<ol>\n<li>The package used write.csv to generate the CSVs, so I migrated to <a href=\"https:\/\/cran.r-project.org\/web\/packages\/data.table\/index.html\">fwrite<\/a>.<\/li>\n<li>The compression of the CSVs was sequential, so I parallelized the compression of the CSVs using <a href=\"https:\/\/cran.r-project.org\/web\/packages\/future\/index.html\">future<\/a>.<\/li>\n<li>There was only one function to perform an update of a table, which was slow, so I created a function that only inserts records into a table.<\/li>\n<\/ol>\n<p>By making all these changes, the complete loading process was reduced to less than 6 minutes, being 350% faster than the existing one (so I already submitted a pull request to the official repository).<\/p>\n<p>While the pull request is accepted, you can enjoy the optimized library in my <a href=\"https:\/\/github.com\/danielfm123\/redshiftTools\">GitHub<\/a>. You can install it with the following commands from R:<\/p>\n<pre><code class=\"language-r\">install.packages(&quot;aws.ec2metadata&quot;, repos = c(cloudyr = &quot;http:\/\/cloudyr.github.io\/drat&quot;, getOption(&quot;repos&quot;)))\npackages = c(&quot;devtools&quot;, &quot;httr&quot;, &quot;package_n&quot;,&quot;aws.s3&quot;, &quot;Rcpp&quot;,&quot;DBI&quot;,&quot;data.table&quot;,&quot;future&quot;,&quot;future.apply&quot;,&quot;R.utils&quot;,&quot;dplyr&quot;)\nfor(pack in packages){\n  if(!pack %in% rownames(installed.packages())){\n    print(paste(&quot;installing&quot;,pack))\n    install.packages(pack)\n  }\n}\ndevtools::install_github(&quot;danielfm123\/redshiftTools&quot;)<\/code><\/pre>\n<p>A small example of how to use it in the following code:<\/p>\n<pre><code class=\"language-r\">library(data.table)\nlibrary(RPostgreSQL)\nlibrary(toolkitEntel)\nsetDTthreads(0)\ndataset = fread(&quot;Path to some file you want to read&quot;,data.table = F,sep = &quot;|&quot;,encoding = &quot;Latin-1&quot;,nrows = -1)\n#Connect to SQL\ncon = dbConnect(\n  dbDriver(&quot;PostgreSQL&quot;),\n  dbname = &quot;database name&quot;,\n  user = &quot;user&quot;,\n  password = &quot;password&quot;,\n  host = &quot;server ip&quot;\n)\n# Create a table, detects column types\nrs_create_table(\n  dataset,\n  dbcon = con,\n  table_name = &#039;table_name&#039;,\n  bucket = &quot;bucket_name&quot;,\n  region = &quot;us-east-2&quot;,\n  keys = &quot;AWS AIM KEY of S3&quot;,\n  secret_key = &quot;AWS AIM SECRET of S3&quot;\n)\n# add data to an existing table\nrs_append_table(\n  dataset,\n  dbcon = con,\n  table_name = &#039;table_name&#039;,\n  bucket = &quot;bucket_name&quot;,\n  region = &quot;us-east-2&quot;,\n  keys = &quot;AWS AIM KEY of S3&quot;,\n  secret_key = &quot;AWS AIM SECRET of S3&quot;\n)\n# disconnect from redshift\ndbDisconnect(con)<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<div class=\"mh-excerpt\"><p>As many know, Redshift is a fork of Postgres made by Amazon to provide a Data Warehouse service. The big difference between these two products <a class=\"mh-excerpt-more\" href=\"https:\/\/geekosas.com\/index.php\/2018\/05\/23\/write-in-aws-redshift-at-fill-speed\/\" title=\"Write in AWS Redshift at Fill Speed!\">[&#8230;]<\/a><\/p>\n<\/div>","protected":false},"author":1,"featured_media":2561,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-3326","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-sin-categoria"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/08\/RtoRedshift.png?fit=600%2C300&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8vjqF-RE","jetpack-related-posts":[{"id":3223,"url":"https:\/\/geekosas.com\/index.php\/2020\/05\/14\/insert-records-into-the-database-at-full-speed\/","url_meta":{"origin":3326,"position":0},"title":"Insert Records into the Database at Full Speed","author":"Daniel Fischer","date":"2020-05-14","format":false,"excerpt":"ETL tools are very useful for performing automated and recurring data transformation processes; they are characterized by performing three tasks: (E) Extract: Connect to one or more sources and extract data. (T) Transform: Transform or manipulate the data. (L) Load: Load the transformed data into the final repository. That is\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2021\/09\/patch.png?fit=728%2C380&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2021\/09\/patch.png?fit=728%2C380&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2021\/09\/patch.png?fit=728%2C380&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2021\/09\/patch.png?fit=728%2C380&ssl=1&resize=700%2C400 2x"},"classes":[]},{"id":3314,"url":"https:\/\/geekosas.com\/index.php\/2017\/05\/23\/estimation-of-vote-distribution-in-the-second-round-of-the-chilean-presidential-elections-2017\/","url_meta":{"origin":3326,"position":1},"title":"Estimation of vote distribution in the second round of the Chilean presidential elections 2017","author":"Daniel Fischer","date":"2017-05-23","format":false,"excerpt":"This article was made in a hurry and with little data; the definitive one is at Who voted for each candidate? After the first round of the Chilean presidential elections, it is traditional for each candidate to begin \"auctioning off\" the votes of their followers to the candidate they feel\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2017\/12\/segunda-vuelta.jpg?fit=710%2C399&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2017\/12\/segunda-vuelta.jpg?fit=710%2C399&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2017\/12\/segunda-vuelta.jpg?fit=710%2C399&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2017\/12\/segunda-vuelta.jpg?fit=710%2C399&ssl=1&resize=700%2C400 2x"},"classes":[]},{"id":3296,"url":"https:\/\/geekosas.com\/index.php\/2017\/05\/23\/i-will-teach-an-r-course\/","url_meta":{"origin":3326,"position":2},"title":"I will teach an R course.","author":"Daniel Fischer","date":"2017-05-23","format":false,"excerpt":"The course will be at Microsoft Chile on September 22: The R Intensive is an event designed for those who have data analysis and modeling needs in their work and want to gain in 1 day the theoretical and practical knowledge to start solving their analytical challenges with this tool.\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=700%2C400 2x"},"classes":[]},{"id":3291,"url":"https:\/\/geekosas.com\/index.php\/2017\/05\/23\/movies-2016\/","url_meta":{"origin":3326,"position":3},"title":"Movies 2016","author":"Daniel Fischer","date":"2017-05-23","format":false,"excerpt":"Movies make us laugh, cry, and some... sleep, so I decided to do a small analysis on 2016 movies. As with Video Games and Data Science, we did web scraping from www.metacritic.com to generate a database, in which, for each movie we obtained the following information: Country of Origin Genres\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/03\/histogramas-300x120.png?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/03\/histogramas-300x120.png?resize=350%2C200 1x, https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/03\/histogramas-300x120.png?resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/03\/histogramas-300x120.png?resize=700%2C400 2x"},"classes":[]},{"id":3343,"url":"https:\/\/geekosas.com\/index.php\/2019\/05\/23\/we-are-back\/","url_meta":{"origin":3326,"position":4},"title":"We are back","author":"Daniel Fischer","date":"2019-05-23","format":false,"excerpt":"Despite our EX-HOSTING DigitalOcean, which after being a customer for 5 years deleted all the content of my account, including the work of this website, due to a one-month late payment. It is reasonable that servers are taken down for non-payment, or even that they are deleted. I must say\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/07\/digital-ocean.png?fit=815%2C495&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/07\/digital-ocean.png?fit=815%2C495&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/07\/digital-ocean.png?fit=815%2C495&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2019\/07\/digital-ocean.png?fit=815%2C495&ssl=1&resize=700%2C400 2x"},"classes":[]},{"id":3330,"url":"https:\/\/geekosas.com\/index.php\/2018\/05\/23\/elastic-computing-for-begginers-spanish-video\/","url_meta":{"origin":3326,"position":5},"title":"Elastic Computing for Begginers (Spanish video)","author":"Daniel Fischer","date":"2018-05-23","format":false,"excerpt":"I made a video showing how to use a template of an EC2 Instance (Virtual PC) for Data Science\/Analytics in AWS using elastic computing. The AMI image was built by me and is based on Arch, so you can always have the latest version of everything. On the other hand,\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/12\/onedrive-illo3.jpg?fit=1200%2C675&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/12\/onedrive-illo3.jpg?fit=1200%2C675&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/12\/onedrive-illo3.jpg?fit=1200%2C675&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/12\/onedrive-illo3.jpg?fit=1200%2C675&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/12\/onedrive-illo3.jpg?fit=1200%2C675&ssl=1&resize=1050%2C600 3x"},"classes":[]}],"_links":{"self":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts\/3326","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/comments?post=3326"}],"version-history":[{"count":1,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts\/3326\/revisions"}],"predecessor-version":[{"id":3327,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts\/3326\/revisions\/3327"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/media\/2561"}],"wp:attachment":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/media?parent=3326"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/categories?post=3326"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/tags?post=3326"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}