As many know, Redshift is a fork of Postgres made by Amazon to provide a Data Warehouse service. The big difference between these two products is that the former is a columnar and compressed database, while the latter is not.
Columnar databases are very fast for aggregations and joins, but not so much for inserting records (do not expect more than 4 records per second with a normal insert in Redshift), so they generally have some non‑standard method. In the case of Redshift, the method is as follows:
- If the table does not exist in the database: create it.
- Save the table you want to load in S3, ideally in several CSV files compressed with gzip.
- From Redshift call the COPY command to read the files from S3 and load them into the database.
As you can see, it is not a standard process, so sicarul created the library redshiftTools that performs all this process sequentially, but the library was quite slow: loading a dataset of 1.5 million records and 800 columns takes 21 minutes. That is why I decided to review the code to see if I could make some optimizations, among which I found:
- The package used write.csv to generate the CSVs, so I migrated to fwrite.
- The compression of the CSVs was sequential, so I parallelized the compression of the CSVs using future.
- There was only one function to perform an update of a table, which was slow, so I created a function that only inserts records into a table.
By making all these changes, the complete loading process was reduced to less than 6 minutes, being 350% faster than the existing one (so I already submitted a pull request to the official repository).
While the pull request is accepted, you can enjoy the optimized library in my GitHub. You can install it with the following commands from R:
install.packages("aws.ec2metadata", repos = c(cloudyr = "http://cloudyr.github.io/drat", getOption("repos")))
packages = c("devtools", "httr", "package_n","aws.s3", "Rcpp","DBI","data.table","future","future.apply","R.utils","dplyr")
for(pack in packages){
if(!pack %in% rownames(installed.packages())){
print(paste("installing",pack))
install.packages(pack)
}
}
devtools::install_github("danielfm123/redshiftTools")
A small example of how to use it in the following code:
library(data.table)
library(RPostgreSQL)
library(toolkitEntel)
setDTthreads(0)
dataset = fread("Path to some file you want to read",data.table = F,sep = "|",encoding = "Latin-1",nrows = -1)
#Connect to SQL
con = dbConnect(
dbDriver("PostgreSQL"),
dbname = "database name",
user = "user",
password = "password",
host = "server ip"
)
# Create a table, detects column types
rs_create_table(
dataset,
dbcon = con,
table_name = 'table_name',
bucket = "bucket_name",
region = "us-east-2",
keys = "AWS AIM KEY of S3",
secret_key = "AWS AIM SECRET of S3"
)
# add data to an existing table
rs_append_table(
dataset,
dbcon = con,
table_name = 'table_name',
bucket = "bucket_name",
region = "us-east-2",
keys = "AWS AIM KEY of S3",
secret_key = "AWS AIM SECRET of S3"
)
# disconnect from redshift
dbDisconnect(con)

Leave a Reply