Multi-Core Performance in R

Introduction

A few days ago, while walking around, I saw that they were selling a used HP Proliant DL360 G6. For those who don’t know, it’s a high‑performance server from 2010.

Due to my interest in Data Science and BIG DATA, this toy will be useful for a diploma course I’m interested in, which is taught by the FCFM of the University of Chile.

Toy Specifications:

  • 2 x Xeon 5530 2.4 GHz, 4 processors with hyperthreading (disabled in this test)
  • 64 GB RAM
  • Nvidia Quadro M2000 4GB VRAM GDDR5
  • Smart Array RAID card with battery
  • 2 independent power supplies
  • 4 SAS hard drives

Toy Photos:

The Test

I will use the plyr, data.table, and reshape2 libraries for aggregations, and doSNOW for parallelization. The dataset will be a table of about 30 million records similar to a mobile operator’s database (as far as I remember, these records are called CDRs stored in the CRCE of the rating platform).

Basically, the table will have a user identifier, a traffic type identifier, and the traffic amount.

Generate the Data

We will generate 30 million CDR records in a table that will have:

  • User identifier
  • Traffic type
  • Traffic amount
options(stringsAsFactors = FALSE)
library(reshape2)
library(plyr)

# Parameters
set.seed(1986) # seed set to the birth date of my dear girlfriend
tipo_trafico = c("DATA","SMS","VOICE")
prob_tipo_trafico = c(0.6,0.1,0.3)
userids = 5e4 # simulate n users
registros = 3e7 # number of CDR records

# Random traffic amount values
datos_tipo_trafico =
  data.frame(
    DATA = round(runif(registros,1,1024^2)),  # data session in bytes, between 1 byte and 1 MB
    SMS = rep(1,registros),                    # each SMS generates one record
    VOICE = round(rexp(registros,1/120))       # call length in seconds
  )

# Generate CDR records
cdr = data.frame(
  id_cdr = 1:registros,
  userid = floor(runif(registros,1,userids+1)),
  tipo_trafico = sample(tipo_trafico,registros,replace = T,prob = prob_tipo_trafico)
)

TI = proc.time()
cdr$trafico = datos_tipo_trafico$DATA*(cdr$tipo_trafico=="DATA") + 
              datos_tipo_trafico$SMS*(cdr$tipo_trafico=="SMS") + 
              datos_tipo_trafico$VOICE*(cdr$tipo_trafico=="VOICE")
print(proc.time()-TI)
rm(datos_tipo_trafico) # delete random data table
gc() # clean unused memory

The result is a table with this format:

id_cdr userid tipo_trafico trafico
1 1 5579 VOICE 20
2 2 28374 VOICE 82
3 3 28526 VOICE 36
4 4 39179 VOICE 56
5 5 14244 DATA 629075
6 6 36779 DATA 690397
7 7 42774 DATA 175632
8 8 4276 VOICE 115
9 9 4445 VOICE 44
10 10 29458 DATA 946171

Benchmark

We will create a table of aggregated traffic per user. This table will be generated with: reshape2, data.table, and plyr (single‑thread and multi‑thread).

Some comments before showing the results:

  • reshape2 allows aggregation but offers far fewer options than the other libraries.
  • data.table is an extension of data.frame that allows the use of indices.
  • plyr can run either single‑thread or in parallel.

The code used:

# using reshape2
TI = proc.time()
tmp = reshape2::dcast(cdr, userid ~ tipo_trafico, fun.aggregate = sum, value.var = "trafico")
print(proc.time()-TI)

# using plyr 1 thread
TI = proc.time()
tmp = ddply(cdr, "userid", function(x) data.frame(DATA = sum(ifelse(x$tipo_trafico == "DATA", x$trafico, 0)),
                                                  SMS = sum(ifelse(x$tipo_trafico == "SMS", x$trafico, 0)),
                                                  VOICE = sum(ifelse(x$tipo_trafico == "VOICE", x$trafico, 0))
                                                  ), .progress = T)
print(proc.time()-TI)

# using plyr 8 threads (one per CPU)
library("doSNOW")
nCPU = as.numeric(Sys.getenv("NUMBER_OF_PROCESSORS")[1])
cl = makeSOCKcluster(nCPU, outfile="cl.txt")
registerDoSNOW(cl)
TI = proc.time()
tmp = ddply(cdr, "userid", function(x) data.frame(DATA = sum(ifelse(x$tipo_trafico == "DATA", x$trafico, 0)),
                                                  SMS = sum(ifelse(x$tipo_trafico == "SMS", x$trafico, 0)),
                                                  VOICE = sum(ifelse(x$tipo_trafico == "VOICE", x$trafico, 0))
                                                  ), .parallel = T)
print(proc.time()-TI)
stopCluster(cl)

# using plyr 2 threads
library("doSNOW")
nCPU = as.numeric(Sys.getenv("NUMBER_OF_PROCESSORS")[1])
cl = makeSOCKcluster(2, outfile="cl.txt")
registerDoSNOW(cl)
TI = proc.time()
tmp = ddply(cdr, "userid", function(x) data.frame(DATA = sum(ifelse(x$tipo_trafico == "DATA", x$trafico, 0)),
                                                  SMS = sum(ifelse(x$tipo_trafico == "SMS", x$trafico, 0)),
                                                  VOICE = sum(ifelse(x$tipo_trafico == "VOICE", x$trafico, 0))
                                                  ), .parallel = T)
print(proc.time()-TI)
stopCluster(cl)

# using data.table
library(data.table)
cdr_dt = data.table(cdr, key = "userid")
TI = proc.time()
tmp = cdr_dt[, list(DATA = sum(ifelse(tipo_trafico == "DATA", trafico, 0)),
                    SMS = sum(ifelse(tipo_trafico == "SMS", trafico, 0)),
                    VOICE = sum(ifelse(tipo_trafico == "VOICE", trafico, 0))
                    ), by = "userid"]
print(proc.time()-TI)

Results

library function threads seconds
plyr ddply 1 9.18
plyr ddply 2 78.37
plyr ddply 8 146.04
reshape2 dcast 1 17.31
data.table 1 12.94

Conclusions

reshape2, despite being a library specialized in this type of transformation, turned out to be slower than plyr.

plyr lost performance as more threads were added. This is because for simple operations, the cost of parallelization is higher than not using the other CPUs.

data.table, despite using indices, turned out to be slower than plyr. When complex algorithms are used for simple situations, performance can sometimes be lost.

For the Future…

In the future, I will run these tests using larger datasets and different numbers of users to see what results are obtained.

Be the first to comment

Leave a Reply

Your email address will not be published.




This site uses Akismet to reduce spam. Learn how your comment data is processed.