The discussion about which language is best for data analysis can lead to conversations more passionate than topics like religion or politics. But as Data Scientists we must focus on empirical evidence; the dimensions for comparison are many: Community, Performance, Editors, Package Manager, Code Encapsulation, etc. I have evaluated several of these dimensions and find it hard to say which is best—some win in one area, others in another.
Of all the dimensions, I believe there are 3 that make a Data Analysis language effective, and the rest will come thanks to their communities:
1) Existence of a REPL console, to be able to run tests.
2) Performance, because if it is slow, we won’t be able to execute anything.
3) Number of lines to execute, because trial and error are the parents of science and each experiment cannot take too much time.
There are more than three languages that meet these conditions, such as Ruby or Kotlin, but we will focus on Python, R, and Julia.
Python is a multi-purpose object-oriented language on which packages for data analysis have been developed, such as pandas for tables, scikit for Machine Learning, and Numpy for matrices, among others. Its main strength is that since any development can be done in Python, it greatly facilitates integrating models into production applications.
R is a functional language invented for data analysis. Its main data structure is DataFrames (in-memory tables) and the entire language revolves around them. Although R is focused on manipulating tables, thanks to its huge community, today you can do completely different and unexpected things with it, such as web pages (in Shiny) or REST APIs (Plumber).
Julia is another functional language, very new, similar to R but not so centered on DataFrames. Its focus is on being fast and distributed, achieving performance close to C++. The interesting thing is that while the efficient libraries of R and Python are programmed in another language, Julia is mainly written in Julia.
The three languages above are high-level and have a REPL console, so through 4 experiments we will focus on dimensions 2) Performance and 3) Number of lines; on the other hand, since Julia is very new, we will have results for Julia current and 6 months ago.
Experiment 1
Aggregation on a DataFrame of 10 million records. Some languages have more than one library to do this, so there will be more than one result for each.
In the following graph we see the times for different libraries to execute the same aggregation, where:
data.table is a library for manipulating tables programmed in C and focused mainly on being fast.
data.table + setkey – data.tables allow creating indexes; in this case, an index was created on the column that was aggregated.
DataFrames is one of the libraries for manipulating tables in Julia.
JuliaDB is one of the libraries for manipulating tables in Julia.
pandas is the library for manipulating tables in Python, based on Numpy.
Queryverse is one of the libraries for manipulating tables in Julia and aims to imitate Tidyverse.
tidyverse is actually dplyr, which, due to its easy notation, is the most popular library for manipulating tables in R.
Bar Plot of Execution Times

Clearly Julia is not the fastest; Python, which has recently optimized pandas a lot, was. But it is worth noting that in the last 6 months it has improved a lot, catching up with data.tables (data.tables + setkey only considers the aggregation time, not the index creation time, so it cannot be considered the winner). It should also be noted that since Julia is written in Julia, these aggregations can be programmed in a loop without losing speed. An example will be shown at the end in the code section.
The codes to perform each of these aggregations are as follows:
data.table
dt[,list(valor = mean(valor)),by='categoria']
data.table + setkey
setkey(dt,categoria)
dt[,list(valor = mean(valor)),by='categoria']
DataFrames
aggregate(datos,:categoria,[mean])
or by(datos,:categoria,x->(promedio = mean(x.valor)))
JuliaDB
JuliaDB.groupby(mean, indexed, :categoria, select =:valor)
pandas
datos.groupby(['categoria']).aggregate(np.mean)
Queryverse
datos |> @groupby(.categoria) |> @map({categoria = key(),promedio = mean(_.valor)}) |> DataFrame
tidyverse
datos %>% group_by(categoria) %>% summarise(valor = mean(valor))
Julia using a Loop
results = []
for slice in groupby(datos,:categoria)
push!(results,(promedio = slice.categoria[1], valor = mean(slice.valor)))
end
DataFrame(results)
All are similar in length, but the syntax of Queryverse and Tidyverse are the clearest and most flexible. On the other hand, the last example with Julia has the performance of DataFrames, with maximum flexibility, but not efficiency in the number of lines.
Experiment 2
Recursion , specifically Fibonacci series programmed in a recursive function. A recursive function is one that calls itself several times to obtain a result, so this experiment measures efficiency in explicit code, without libraries.
In particular, we compute Fibonacci(100), which forces us to call the function 100 times.
All codes are very similar, so I will only show the code in Python:
def fico(n,contador = 2, ant = 1, antant = 0):
if(n <= 1):
return n
if(n == contador):
return ant + antant
else:
return fico(n,contador+1,ant + antant,ant)
What is not so similar is the execution time. Below are the execution times for each language at two different dates. The clear loser is Python from 2020/08/01, which corresponded to Python 3.7.

Now if we look at the results on a logarithmic scale (log10), we see that all three languages have improved in the last 6 months and Julia is the clear winner by more than 2 orders of magnitude. On the other hand, Python 3.8 improved a lot compared to the previous version, managing to be on par with R.

Frankly, the times for each language were small, so I would only recommend Julia if your model requires calling a user-defined function many times, for example in Optimization using the gradient method.
Experiment 3
Performance on Matrices
All three languages are very powerful in matrix handling; many operations can be performed on matrices, but I chose eigenvalue decomposition to run the test. This is a relatively heavy operation.
Bar Plot of Times to Compute Eigenvalues

We see that Julia is the fastest, but very close to R. Python, on the other hand, lags behind.
On the other hand, the notation for all three cases is very similar, but I will show it both ways.
R
mat = matrix(runif(1000*1000),nrow=1000)
svd(mat)
Julia
mat = rand(1000,1000)
svd(mat)
Python is a bit different because it is object-oriented:
mat = np.random.rand(1000,1000)
np.linalg.svd(mat)
Experiment 4
Time to modify elements of a matrix inside a loop.
In this case, in a loop we will fill the previous matrix with mat[i,j] = i + j. The times are shown in the following graph:

Python is by far the slowest, although it has improved in Python 3.8. On the other hand, since R objects are immutable, I expected worse results.
Again the codes are similar, but I want to share an honorable mention in Python that allows iterating over two variables with a for loop:
import itertools
for i,j in itertools.product(range(1000),range(1000)):
mat[i,j] = i+j
Footnote
I did this analysis because I remembered that the last time I tried to use Julia, it was very annoying that importing libraries took a long time. This has improved, but they still take a while, especially for large libraries like Queryverse. Still, it is an interesting project.
On the other hand, in general the languages are similar, but I think the main difference is Python’s object orientation, which is very evident in how Pandas DataFrames are manipulated.
Finally, although the performance differences may seem large, when compared to Excel, all three alternatives are fast and efficient.
Greetings!

Leave a Reply