Insert Records into the Database at Full Speed

ETL tools are very useful for performing automated and recurring data transformation processes; they are characterized by performing three tasks:
(E) Extract: Connect to one or more sources and extract data.
(T) Transform: Transform or manipulate the data.
(L) Load: Load the transformed data into the final repository.

That is where the name ETL comes from.

Generally, the final stage (Load) loads the data into a data lake or an SQL database, typically Microsoft SQL Server. This data-loading stage often takes up a significant portion of the ETL process because the insert operation—which inserts rows into the database—is relatively slow.

That is why there is a way to insert data in bulk called Bulk Insert, which allows the SQL server to read a data file and insert it directly into the database.

Today we are going to compare the performance of these two techniques (Insert and Bulk Insert) by writing from R to SQL; we will compare the time it takes to write databases or tables of different sizes.

The experiment will consist of inserting 3 datasets with 100,000, 1,000,000, and 10,000,000 rows, each with 10 columns of 20 random characters.

To perform these tests, we will use the official Microsoft ODBC driver on Linux with the odbc package for R. Additionally, we will generate the CSV file for bulk insert using the data.table library, which is the fastest one I know of (if you know of another one, let me know). In both cases, we’ll measure the total load time—that is, from when the data is ready until it’s inserted—including creating and uploading the data files to the server.

The system configuration is quite simple: the SQL server and R are in the same data center, and file transfer is handled via SAMBA (a Windows shared folder).

Bulk Insert vs. Insert

On the Y-axis we have time and on the X-axis the number of rows. We can see that the normal insert is considerably slower and that we don’t have data for 10 million rows—this is because the load failed.

The difference is staggering, but according to SQL Server documentation, bulk insert can run in parallel, so we’ll compare the speeds for 1, 4, and 8 threads

Parallel Insertion.

I want to point out that the programming required to achieve parallel insertion is somewhat complex, but here are the results:

We can see that for large datasets, using 8 threads is beneficial, while for small datasets it can be problematic. This is because splitting the dataset and creating the threads takes time (the bulk insert call was made in Python, because creating a thread in R consumes a lot of resources; the rericulate package was used to facilitate this interaction between R and Python).

Types of Storage

Finally, columnar databases are very popular these days; they are designed more for data processing than for transactional purposes.

We can see that the columnstore is slightly faster than the classic rowstore. This is relevant because the greatest benefit of the columnstore comes from data retrieval rather than insertion, so it is highly recommended.

Conclusions

Bulk Insert is approximately 10 times faster for inserting data into a database.

Parallel insertion offers significant benefits, but the programming is complex; if anyone needs guidance, please don’t hesitate to contact me.

Finally, the columnstore has some positive impact, but it is recommended because it improves read speed and is easy to configure.

Lastly: I’m not sharing the code this time because I developed it for the Anasac analytics environment where I currently work, so it’s not generic. However, if anyone needs help, please don’t hesitate to contact me.

Translated with DeepL.com (free version)

Geekosas

ASDFantastinc!