Tubular data is very common in bioinformatics, and in R this data is usually stored as a data.frame. You may have heard that working column-wise on a data.frame is much faster than working row-wise. I want to explain both why this concept exists, and some tips to get efficient performance when working with data.frames.
Other R Tutorials:
- Factors in R - How do they work, and how did I break my data?
- Long versus wide data for plotting genomics data in R.
- Pipes in R: An introduction and some advanced tips and tricks.
data.frames are fancy lists
Lists are base objects in R
There were a few objects in R that were developed before R introduced formal classes. These objects are referred to as base objects, and encompass things such as characters, integers, and logicals.
Lists are a base object.
lst <- list(1:2, 3:4) > lst []  1 2 []  3 4 > sloop::otype(lst)  "base"
Base objects have a class.
> class(lst)  "list"
Base objects do not have attributes.
> attributes(lst) NULL
data.frames are S3 objects
data.frames on the other hand are S3 objects. S3 objects are a type of formal object in R in the object oriented programming sense.
df <- data.frame(A=1:2, B=3:4) > df A B 1 1 3 2 2 4 > sloop::otype(df)  "S3"
Unlike base objects, S3 objects have attributes, and the class is formally defined in the attributes.
> attributes(df) $names  "A" "B" $class  "data.frame" $row.names  1 2
data.frames are lists
We know that data.frames are S3 objects and have the attributes that define their class (among other things). But what does this all mean? data.frames are actually just lists with extra stuff tacked on. We can first see this by checking what
typeof the object for a data.frame is.
> typeof(df)  "list"
Notice how it returns a list. We can see this better by peaking into the internals of R.
First, we’ll check the internals of the list we made earlier.
> lobstr::sxp(lst) [1:0x560f09f86908] <VECSXP> (named:4) [2:0x560f0944a248] <INTSXP> (altrep named:65535) [3:0x560f09484de0] <INTSXP> (altrep named:65535)
The list object is of type vector VECSXP. This object is holding two integers (INTSXP) of length two, corresponding to
3,4 in our two separate list elements.
Now lets check what our data.frame looks like.
> lobstr::sxp(df) [1:0x560f076c5d78] <VECSXP> (object named:31) A [2:0x560f09bbc9c8] <INTSXP> (altrep named:65535) B [3:0x560f09bf61e8] <INTSXP> (altrep named:65535) _attrib [4:0x560f09c3d098] <LISTSXP> (named:1) names [5:0x560f076c6078] <STRSXP> (named:65535) class [6:0x560f09bb2678] <STRSXP> (named:65535) row.names [7:0x560f09c23f98] <INTSXP> (named:65535)
The data.frame object is of type VECSXP like our lists. Also like our list this object is holding two integers (INTSXP) of length two, corresponding to
1,2 in column 1, and
3,4 in column 2. Unlike a list, the data.frame has attributes
row.names, which we also saw before when we used the
How is a data.frame constructed?
Since data.frames are fancy lists, you can actually approximate the creation of a data.frame from a list with a few commands. It's a good way to get an intuitive feel about what makes up a data.frame.
We have our list from before.
> lst []  1 2 []  3 4
How can we turn it into the data.frame we made before?
> df A B 1 1 3 2 2 4
First, we need to give the list elements names. The names of the list elements will correspond to the column names of the data.frame.
names(lst) <- c("A", "B") > lst $A  1 2 $B  3 4 > attributes(lst) $names  "A" "B"
The next attribute we need to add are row.names, which are the values next to the first column in a data.frame.
attr(lst, "row.names") <- 1:2 > lst $A  1 2 $B  3 4 attr(,"row.names")  1 2 > attributes(lst) $names  "A" "B" $row.names  1 2
Finally, we need to set the class to data.frame.
class(lst) <- "data.frame" > lst A B 1 1 3 2 2 4
Our list is now a properly formatted data.frame!
> sloop::otype(lst); class(lst)  "S3"  "data.frame" > attributes(lst) $names  "A" "B" $row.names  1 2 $class  "data.frame"
But remember, even though it’s a data.frame, it’s still that list we started off with at heart.
> typeof(lst)  "list"
Why are row-wise operations slow?
Although there is no universal definition to a vectorized function in R, you can broadly state that a vectorized function will take a vector (multiple values) as input, apply a function to each value, and will be often written in C for maximum speed. Vectorized operations are generally faster and more efficient at a particular task than non-vectorized operations since the "looping" is done internally where speed tricks can be applied.
A simple example of a vectorized function in R is
log, because you can provide it a vector of values and it will apply a log transformation to each value.
> log(1:5)  0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
If log was theoretically not vectorized, you would need a loop to apply the transformation to all values, which would be slower.
> vals <- numeric(5) > for (n in 1:5) vals[n] <- log(n) > print(vals)  0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
Vectorized functions and lists.
Let’s recreate that list we made earlier.
lst <- list(A=1:2, B=3:4) > lst $A  1 2 $B  3 4
Each list element is a vector of values, so working within each list element can allow us to use vectorized functions. For example, we could sum all the values in a list element using the vectorized function
> sum(lst$A)  3
Or you can use lapply to iterate over every list element and apply the vectorized function
sum to them all.
> lapply(lst, sum) $A  3 $B  7
However, if we wanted to sum the first value in list element A with the first value in list element B, and so on and so forth, we would need to iterate over the first element of every list, then iterate over the second element, etc., which would be a lot slower since you are revisiting each list element multiple times.
vals <- numeric(2) for (n in 1:2) vals[n] <- sum(lst$A[n], lst$B[n]) > vals  4 6
Vectorized functions and data.frames.
Remember that data.frames are lists, and each column is a list element. The data.frame
df we made before is the data.frame equivalent of the list we made above.
> df A B 1 1 3 2 2 4
Because columns are list elements, we can use vectorized functions over columns. For example, we could use the same
lapply command we used in the list to get the sum for each column.
> lapply(df, sum) $A  3 $B  7
Because data.frames are column oriented lists, you run into the same problem as you did with lists if you want to perform row-wise operations. Let’s get the row sums this time instead of the column sums. (Ignore that we have functions like rowSums for now as you won’t always have a nice vectorized row-wise function available).
vals <- numeric(2) for (n in 1:2) vals[n] <- sum(df[n, 1], df[n, 2]) > vals  4 6
We again had to iterate over the columns (list elements) for each row, which is significantly slower than when we applied the vectorized
sum function over each column.
So how do we work with data.frames?
When working with data.frames you want to maximize the use of vectorized functions.
Use vectorized functions on columns when available.
Many common functions are vectorized, such as
paste for example. This means you can provide the entire column as input to the function, it will apply the function to each value, and then return the transformed value.
df$A <- paste("number", df$A) > df A B 1 number 1 3 2 number 2 4
There are some functions that are vectorized to work row-wise.
You sometimes have vectorized row-wise functions. They tend to be only for common tasks, so for anything complex you likely won’t have access to it. An example is the
df <- data.frame(A=1:2, B=3:4) > rowSums(df)  4 6
It’s generally good practice to have your data.frame in a long format. I detail this in great depth in my tutorial here, but I will give a brief example here. This is how data is usually stored in a relational database as well, such as MySQL.
Let’s say we have a matrix-like object stored in a data.frame, such as a count matrix for two samples "A" and "B". Ignore for now that simple matrices have powerful vectorization options for the sake of this example.
cts <- df cts$gene_id <- sprintf("ENSG%06d", seq_len(2)) A B gene_id 1 1 3 ENSG000001 2 2 4 ENSG000002
Your data.data frame has three "variables": sample, count, and gene_id. These variables should each be in their own column.
library("tidyverse") cts <- pivot_longer(cts, !gene_id, names_to="sample", values_to="counts") > cts # A tibble: 4 x 3 gene_id sample counts <chr> <chr> <int> 1 ENSG000001 A 1 2 ENSG000001 B 3 3 ENSG000002 A 2 4 ENSG000002 B 4
Long formatted data is advantageous for a few reasons. For example, the counts are all in one column, which corresponds to one element of a list, or one vector. This means you can use a vectorized function on all counts at once, such as our trusty log function.
cts$counts <- log(cts$counts) > cts # A tibble: 4 x 3 gene_id sample counts <chr> <chr> <dbl> 1 ENSG000001 A 0 2 ENSG000001 B 1.10 3 ENSG000002 A 0.693 4 ENSG000002 B 1.39
Also, now that samples are represented in one row, you can perform column-wise operations that would have once been row-wise. For example, we could get the mean expression for each gene.
cts %>% group_by(gene_id) %>% summarize(mean_counts=mean(counts)) # A tibble: 2 x 2 gene_id mean_counts <chr> <dbl> 1 ENSG000001 0.549 2 ENSG000002 1.04
As genomics datasets get larger, it becomes more important to take advantage of function vectorization when working with data.frames. Remember that data.frame columns are list elements, and because of this working column-wise will almost always be faster than row-wise. Always take advantage of vectorized functions when possible, seek out vectorized row-wise functions if needed, and consider holding your data in long format.
> sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Matrix products: default BLAS/LAPACK: /geode2/home/u070/rpolicas/Carbonate/.conda/envs/R/lib/libopenblasp-r0.3.10.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  sloop_1.0.1 lobstr_1.1.1 forcats_0.5.1 stringr_1.4.0  dplyr_1.0.5 purrr_0.3.4 readr_1.4.0 tidyr_1.1.3  tibble_3.1.0 ggplot2_3.3.3 tidyverse_1.3.0 loaded via a namespace (and not attached):  Rcpp_1.0.6 cellranger_1.1.0 pillar_1.5.1 compiler_4.0.3  dbplyr_2.1.0 tools_4.0.3 jsonlite_1.7.2 lubridate_1.7.10  lifecycle_1.0.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.10  reprex_1.0.0 cli_2.3.1 rstudioapi_0.13 DBI_1.1.1  haven_2.3.1 withr_2.4.1 xml2_1.3.2 httr_1.4.2  fs_1.5.0 generics_0.1.0 vctrs_0.3.6 hms_1.0.0  grid_4.0.3 tidyselect_1.1.0 glue_1.4.2 R6_2.5.0  fansi_0.4.2 readxl_1.3.1 modelr_0.1.8 magrittr_2.0.1  ps_1.6.0 backports_1.2.1 scales_1.1.1 ellipsis_0.3.1  rvest_1.0.0 assertthat_0.2.1 colorspace_2.0-0 utf8_1.2.1  stringi_1.5.3 munsell_0.5.0 broom_0.7.5 crayon_1.4.1