Write in AWS Redshift at Fill Speed!

As many know, Redshift is a fork of Postgres made by Amazon to provide a Data Warehouse service. The big difference between these two products is that the former is a columnar and compressed database, while the latter is not.

Columnar databases are very fast for aggregations and joins, but not so much for inserting records (do not expect more than 4 records per second with a normal insert in Redshift), so they generally have some non‑standard method. In the case of Redshift, the method is as follows:

  1. If the table does not exist in the database: create it.
  2. Save the table you want to load in S3, ideally in several CSV files compressed with gzip.
  3. From Redshift call the COPY command to read the files from S3 and load them into the database.

As you can see, it is not a standard process, so sicarul created the library redshiftTools that performs all this process sequentially, but the library was quite slow: loading a dataset of 1.5 million records and 800 columns takes 21 minutes. That is why I decided to review the code to see if I could make some optimizations, among which I found:

  1. The package used write.csv to generate the CSVs, so I migrated to fwrite.
  2. The compression of the CSVs was sequential, so I parallelized the compression of the CSVs using future.
  3. There was only one function to perform an update of a table, which was slow, so I created a function that only inserts records into a table.

By making all these changes, the complete loading process was reduced to less than 6 minutes, being 350% faster than the existing one (so I already submitted a pull request to the official repository).

While the pull request is accepted, you can enjoy the optimized library in my GitHub. You can install it with the following commands from R:

install.packages("aws.ec2metadata", repos = c(cloudyr = "http://cloudyr.github.io/drat", getOption("repos")))
packages = c("devtools", "httr", "package_n","aws.s3", "Rcpp","DBI","data.table","future","future.apply","R.utils","dplyr")
for(pack in packages){
  if(!pack %in% rownames(installed.packages())){
    print(paste("installing",pack))
    install.packages(pack)
  }
}
devtools::install_github("danielfm123/redshiftTools")

A small example of how to use it in the following code:

library(data.table)
library(RPostgreSQL)
library(toolkitEntel)
setDTthreads(0)
dataset = fread("Path to some file you want to read",data.table = F,sep = "|",encoding = "Latin-1",nrows = -1)
#Connect to SQL
con = dbConnect(
  dbDriver("PostgreSQL"),
  dbname = "database name",
  user = "user",
  password = "password",
  host = "server ip"
)
# Create a table, detects column types
rs_create_table(
  dataset,
  dbcon = con,
  table_name = 'table_name',
  bucket = "bucket_name",
  region = "us-east-2",
  keys = "AWS AIM KEY of S3",
  secret_key = "AWS AIM SECRET of S3"
)
# add data to an existing table
rs_append_table(
  dataset,
  dbcon = con,
  table_name = 'table_name',
  bucket = "bucket_name",
  region = "us-east-2",
  keys = "AWS AIM KEY of S3",
  secret_key = "AWS AIM SECRET of S3"
)
# disconnect from redshift
dbDisconnect(con)

Be the first to comment

Leave a Reply

Your email address will not be published.




This site uses Akismet to reduce spam. Learn how your comment data is processed.