{"id":3223,"date":"2020-05-14T16:26:23","date_gmt":"2020-05-14T14:26:23","guid":{"rendered":"https:\/\/geekosas.com\/?p=3223"},"modified":"2026-05-23T19:28:57","modified_gmt":"2026-05-23T17:28:57","slug":"insert-records-into-the-database-at-full-speed","status":"publish","type":"post","link":"https:\/\/geekosas.com\/index.php\/2020\/05\/14\/insert-records-into-the-database-at-full-speed\/","title":{"rendered":"Insert Records into the Database at Full Speed"},"content":{"rendered":"<p>ETL tools are very useful for performing automated and recurring data transformation processes; they are characterized by performing three tasks:<br \/>\n(E) Extract: Connect to one or more sources and extract data.<br \/>\n(T) Transform: Transform or manipulate the data.<br \/>\n(L) Load: Load the transformed data into the final repository.<\/p>\n<p>That is where the name <strong>ETL<\/strong> comes from.<\/p>\n<p>Generally, the final stage (Load) loads the data into a data lake or an SQL database, typically Microsoft <strong>SQL Server<\/strong>. This data-loading stage often takes up a significant portion of the ETL process because the <strong>insert<\/strong> operation\u2014which inserts rows into the database\u2014is relatively slow.<\/p>\n<p>That is why there is a way to insert data in bulk called <strong>Bulk Insert<\/strong>, which allows the SQL server to read a data file and insert it directly into the database.<\/p>\n<p>Today we are going to compare the performance of these two techniques (Insert and Bulk Insert) by writing from R to SQL; we will compare the time it takes to write databases or tables of different sizes.<\/p>\n<p>The experiment will consist of inserting 3 datasets with 100,000, 1,000,000, and 10,000,000 rows, each with 10 columns of 20 random characters.<\/p>\n<p>To perform these tests, we will use the official Microsoft ODBC driver on Linux with the odbc package for R. Additionally, we will generate the CSV file for bulk insert using the data.table library, which is the fastest one I know of (if you know of another one, let me know). In both cases, we\u2019ll measure the total load time\u2014that is, from when the data is ready until it\u2019s inserted\u2014including creating and uploading the data files to the server.<\/p>\n<p>The system configuration is quite simple: the SQL server and R are in the same data center, and file transfer is handled via SAMBA (a Windows shared folder).<\/p>\n<h3>Bulk Insert vs. Insert<\/h3>\n<p>On the Y-axis we have time and on the X-axis the number of rows. We can see that the normal insert is considerably slower and that we don\u2019t have data for 10 million rows\u2014this is because the load failed.<\/p>\n<p><img data-recalc-dims=\"1\" height=\"768\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2022\/01\/1-1024x768.png?resize=1024%2C768&#038;ssl=1\" alt=\"\" \/><\/p>\n<p>The difference is staggering, but according to SQL Server documentation, bulk insert can run in parallel, so we\u2019ll compare the speeds for 1, 4, and 8 threads<\/p>\n<h3>Parallel Insertion.<\/h3>\n<p>I want to point out that the programming required to achieve parallel insertion is somewhat complex, but here are the results:<\/p>\n<p><img data-recalc-dims=\"1\" height=\"768\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2022\/01\/2-1024x768.png?resize=1024%2C768&#038;ssl=1\" alt=\"\" \/><\/p>\n<p>We can see that for large datasets, using 8 threads is beneficial, while for small datasets it can be problematic. This is because splitting the dataset and creating the threads takes time (the bulk insert call was made in Python, because creating a thread in R consumes a lot of resources; the rericulate package was used to facilitate this interaction between R and Python).<\/p>\n<h3>Types of Storage<\/h3>\n<p>Finally, columnar databases are very popular these days; they are designed more for data processing than for transactional purposes.<\/p>\n<p><img data-recalc-dims=\"1\" height=\"768\" width=\"1024\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2022\/01\/3-1024x768.png?resize=1024%2C768&#038;ssl=1\" alt=\"\" \/><\/p>\n<p>We can see that the columnstore is slightly faster than the classic rowstore. This is relevant because the greatest benefit of the columnstore comes from data retrieval rather than insertion, so it is highly recommended.<\/p>\n<h3>Conclusions<\/h3>\n<p>Bulk Insert is approximately 10 times faster for inserting data into a database.<\/p>\n<p>Parallel insertion offers significant benefits, but the programming is complex; if anyone needs guidance, please don\u2019t hesitate to contact me.<\/p>\n<p>Finally, the columnstore has some positive impact, but it is recommended because it improves read speed and is easy to configure.<\/p>\n<p>Lastly: I\u2019m not sharing the code this time because I developed it for the Anasac analytics environment where I currently work, so it\u2019s not generic. However, if anyone needs help, please don\u2019t hesitate to contact me.<\/p>\n<p>Translated with DeepL.com (free version)<\/p>\n","protected":false},"excerpt":{"rendered":"<div class=\"mh-excerpt\"><p>ETL tools are very useful for performing automated and recurring data transformation processes; they are characterized by performing three tasks: (E) Extract: Connect to one <a class=\"mh-excerpt-more\" href=\"https:\/\/geekosas.com\/index.php\/2020\/05\/14\/insert-records-into-the-database-at-full-speed\/\" title=\"Insert Records into the Database at Full Speed\">[&#8230;]<\/a><\/p>\n<\/div>","protected":false},"author":1,"featured_media":3082,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-3223","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-sin-categoria"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2021\/09\/patch.png?fit=728%2C380&ssl=1","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8vjqF-PZ","jetpack-related-posts":[{"id":3326,"url":"https:\/\/geekosas.com\/index.php\/2018\/05\/23\/write-in-aws-redshift-at-fill-speed\/","url_meta":{"origin":3223,"position":0},"title":"Write in AWS Redshift at Fill Speed!","author":"Daniel Fischer","date":"2018-05-23","format":false,"excerpt":"As many know, Redshift is a fork of Postgres made by Amazon to provide a Data Warehouse service. The big difference between these two products is that the former is a columnar and compressed database, while the latter is not. Columnar databases are very fast for aggregations and joins, but\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/08\/RtoRedshift.png?fit=600%2C300&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/08\/RtoRedshift.png?fit=600%2C300&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2018\/08\/RtoRedshift.png?fit=600%2C300&ssl=1&resize=525%2C300 1.5x"},"classes":[]},{"id":3235,"url":"https:\/\/geekosas.com\/index.php\/2020\/05\/23\/best-practices-when-programming-with-code\/","url_meta":{"origin":3223,"position":1},"title":"Best Practices When Programming (with code)","author":"Daniel Fischer","date":"2020-05-23","format":false,"excerpt":"Best Practices When Programming (with code) Many enjoy programming and solving algorithmic mazes in their favorite language, but what nobody likes is diving into old code or even worse, code written by someone else. I've had to do it several times, sometimes it was bad, other times worse, many times\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2022\/05\/logo.png?fit=1200%2C600&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2022\/05\/logo.png?fit=1200%2C600&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2022\/05\/logo.png?fit=1200%2C600&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2022\/05\/logo.png?fit=1200%2C600&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2022\/05\/logo.png?fit=1200%2C600&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":3281,"url":"https:\/\/geekosas.com\/index.php\/2016\/05\/23\/multi-core-performance-in-r\/","url_meta":{"origin":3223,"position":2},"title":"Multi-Core Performance in R","author":"Daniel Fischer","date":"2016-05-23","format":false,"excerpt":"Introduction A few days ago, while walking around, I saw that they were selling a used HP Proliant DL360 G6. For those who don't know, it's a high\u2011performance server from 2010. Due to my interest in Data Science and BIG DATA, this toy will be useful for a diploma course\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=700%2C400 2x"},"classes":[]},{"id":3296,"url":"https:\/\/geekosas.com\/index.php\/2017\/05\/23\/i-will-teach-an-r-course\/","url_meta":{"origin":3223,"position":3},"title":"I will teach an R course.","author":"Daniel Fischer","date":"2017-05-23","format":false,"excerpt":"The course will be at Microsoft Chile on September 22: The R Intensive is an event designed for those who have data analysis and modeling needs in their work and want to gain in 1 day the theoretical and practical knowledge to start solving their analytical challenges with this tool.\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2016\/11\/RStudio-Ball.png?fit=1000%2C1000&ssl=1&resize=700%2C400 2x"},"classes":[]},{"id":3304,"url":"https:\/\/geekosas.com\/index.php\/2017\/05\/23\/r-vs-ms-open-r-2\/","url_meta":{"origin":3223,"position":4},"title":"R vs MS open R","author":"Daniel Fischer","date":"2017-05-23","format":false,"excerpt":"Microsoft has realized the power of R, so it has integrated it into its systems, including Power BI and SQL Server 2017. Microsoft released a free version of R with some improvements, including native SQL Server connection, package versioning in MRAN (Microsoft Cran), and optimizations in the linear algebra package.\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/10\/lm-1024x932.png?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/10\/lm-1024x932.png?resize=350%2C200 1x, https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/10\/lm-1024x932.png?resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.geekosas.com\/wp-content\/uploads\/2017\/10\/lm-1024x932.png?resize=700%2C400 2x"},"classes":[]},{"id":3314,"url":"https:\/\/geekosas.com\/index.php\/2017\/05\/23\/estimation-of-vote-distribution-in-the-second-round-of-the-chilean-presidential-elections-2017\/","url_meta":{"origin":3223,"position":5},"title":"Estimation of vote distribution in the second round of the Chilean presidential elections 2017","author":"Daniel Fischer","date":"2017-05-23","format":false,"excerpt":"This article was made in a hurry and with little data; the definitive one is at Who voted for each candidate? After the first round of the Chilean presidential elections, it is traditional for each candidate to begin \"auctioning off\" the votes of their followers to the candidate they feel\u2026","rel":"","context":"In &quot;Sin categor\u00eda&quot;","block_context":{"text":"Sin categor\u00eda","link":"https:\/\/geekosas.com\/index.php\/category\/sin-categoria\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2017\/12\/segunda-vuelta.jpg?fit=710%2C399&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2017\/12\/segunda-vuelta.jpg?fit=710%2C399&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2017\/12\/segunda-vuelta.jpg?fit=710%2C399&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/geekosas.com\/wp-content\/uploads\/2017\/12\/segunda-vuelta.jpg?fit=710%2C399&ssl=1&resize=700%2C400 2x"},"classes":[]}],"_links":{"self":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts\/3223","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/comments?post=3223"}],"version-history":[{"count":1,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts\/3223\/revisions"}],"predecessor-version":[{"id":3226,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/posts\/3223\/revisions\/3226"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/media\/3082"}],"wp:attachment":[{"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/media?parent=3223"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/categories?post=3223"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/geekosas.com\/index.php\/wp-json\/wp\/v2\/tags?post=3223"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}