data.frames are fancy lists

Question

Tutorial:data.frames in R: How vectorization can save you time and heartache

6

Entering edit mode

3.1 years ago

rpolicastro 13k

Tubular data is very common in bioinformatics, and in R this data is usually stored as a data.frame. You may have heard that working column-wise on a data.frame is much faster than working row-wise. I want to explain both why this concept exists, and some tips to get efficient performance when working with data.frames.

data.frames are fancy lists

Lists are base objects in R

There were a few objects in R that were developed before R introduced formal classes. These objects are referred to as base objects, and encompass things such as characters, integers, and logicals.

Lists are a base object.

lst <- list(1:2, 3:4)

> lst
[[1]]
[1] 1 2

[[2]]
[1] 3 4

> sloop::otype(lst)
[1] "base"

Base objects have a class.

> class(lst)
[1] "list"

Base objects do not have attributes.

> attributes(lst)
NULL

data.frames are S3 objects

data.frames on the other hand are S3 objects. S3 objects are a type of formal object in R in the object oriented programming sense.

df <- data.frame(A=1:2, B=3:4)

> df
  A B
1 1 3
2 2 4

> sloop::otype(df)
[1] "S3"

Unlike base objects, S3 objects have attributes, and the class is formally defined in the attributes.

> attributes(df)
$names
[1] "A" "B"

$class
[1] "data.frame"

$row.names
[1] 1 2

data.frames are lists

We know that data.frames are S3 objects and have the attributes that define their class (among other things). But what does this all mean? data.frames are actually just lists with extra stuff tacked on. We can first see this by checking what typeof the object for a data.frame is.

> typeof(df)
[1] "list"

Notice how it returns a list. We can see this better by peaking into the internals of R.

First, we’ll check the internals of the list we made earlier.

> lobstr::sxp(lst)
[1:0x560f09f86908] <VECSXP[2]> (named:4)
  [2:0x560f0944a248] <INTSXP[2]> (altrep named:65535)
  [3:0x560f09484de0] <INTSXP[2]> (altrep named:65535)

The list object is of type vector VECSXP. This object is holding two integers (INTSXP) of length two, corresponding to 1,2 and 3,4 in our two separate list elements.

Now lets check what our data.frame looks like.

> lobstr::sxp(df)
[1:0x560f076c5d78] <VECSXP[2]> (object named:31)
  A [2:0x560f09bbc9c8] <INTSXP[2]> (altrep named:65535)
  B [3:0x560f09bf61e8] <INTSXP[2]> (altrep named:65535)
  _attrib [4:0x560f09c3d098] <LISTSXP> (named:1)
    names [5:0x560f076c6078] <STRSXP[2]> (named:65535)
    class [6:0x560f09bb2678] <STRSXP[1]> (named:65535)
    row.names [7:0x560f09c23f98] <INTSXP[2]> (named:65535)

The data.frame object is of type VECSXP like our lists. Also like our list this object is holding two integers (INTSXP) of length two, corresponding to 1,2 in column 1, and 3,4 in column 2. Unlike a list, the data.frame has attributes names, class, and row.names, which we also saw before when we used the attributes function.

How is a data.frame constructed?

Since data.frames are fancy lists, you can actually approximate the creation of a data.frame from a list with a few commands. It's a good way to get an intuitive feel about what makes up a data.frame.

We have our list from before.

> lst
[[1]]
[1] 1 2

[[2]]
[1] 3 4

How can we turn it into the data.frame we made before?

> df
  A B
1 1 3
2 2 4

First, we need to give the list elements names. The names of the list elements will correspond to the column names of the data.frame.

names(lst) <- c("A", "B")

> lst
$A
[1] 1 2

$B
[1] 3 4

> attributes(lst)
$names
[1] "A" "B"

The next attribute we need to add are row.names, which are the values next to the first column in a data.frame.

attr(lst, "row.names") <- 1:2

> lst
$A
[1] 1 2

$B
[1] 3 4

attr(,"row.names")
[1] 1 2

> attributes(lst)
$names
[1] "A" "B"

$row.names
[1] 1 2

Finally, we need to set the class to data.frame.

class(lst) <- "data.frame"

> lst
  A B
1 1 3
2 2 4

Our list is now a properly formatted data.frame!

> sloop::otype(lst); class(lst)
[1] "S3"
[1] "data.frame"

> attributes(lst)
$names
[1] "A" "B"

$row.names
[1] 1 2

$class
[1] "data.frame"

But remember, even though it’s a data.frame, it’s still that list we started off with at heart.

> typeof(lst)
[1] "list"

Why are row-wise operations slow?

Vectorized functions.

Although there is no universal definition to a vectorized function in R, you can broadly state that a vectorized function will take a vector (multiple values) as input, apply a function to each value, and will be often written in C for maximum speed. Vectorized operations are generally faster and more efficient at a particular task than non-vectorized operations since the "looping" is done internally where speed tricks can be applied.

A simple example of a vectorized function in R is log, because you can provide it a vector of values and it will apply a log transformation to each value.

> log(1:5)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379

If log was theoretically not vectorized, you would need a loop to apply the transformation to all values, which would be slower.

> vals <- numeric(5)
> for (n in 1:5) vals[n] <- log(n)
> print(vals)
[1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379

Vectorized functions and lists.

Let’s recreate that list we made earlier.

lst <- list(A=1:2, B=3:4)

> lst
$A
[1] 1 2

$B
[1] 3 4

Each list element is a vector of values, so working within each list element can allow us to use vectorized functions. For example, we could sum all the values in a list element using the vectorized function sum.

> sum(lst$A)
[1] 3

Or you can use lapply to iterate over every list element and apply the vectorized function sum to them all.

> lapply(lst, sum)
$A
[1] 3

$B
[1] 7

However, if we wanted to sum the first value in list element A with the first value in list element B, and so on and so forth, we would need to iterate over the first element of every list, then iterate over the second element, etc., which would be a lot slower since you are revisiting each list element multiple times.

vals <- numeric(2)
for (n in 1:2) vals[n] <- sum(lst$A[n], lst$B[n])

> vals
[1] 4 6

Vectorized functions and data.frames.

Remember that data.frames are lists, and each column is a list element. The data.frame df we made before is the data.frame equivalent of the list we made above.

> df
  A B
1 1 3
2 2 4

Because columns are list elements, we can use vectorized functions over columns. For example, we could use the same lapply command we used in the list to get the sum for each column.

> lapply(df, sum)
$A
[1] 3

$B
[1] 7

Because data.frames are column oriented lists, you run into the same problem as you did with lists if you want to perform row-wise operations. Let’s get the row sums this time instead of the column sums. (Ignore that we have functions like rowSums for now as you won’t always have a nice vectorized row-wise function available).

vals <- numeric(2)
for (n in 1:2) vals[n] <- sum(df[n, 1], df[n, 2])

> vals
[1] 4 6

We again had to iterate over the columns (list elements) for each row, which is significantly slower than when we applied the vectorized sum function over each column.

So how do we work with data.frames?

When working with data.frames you want to maximize the use of vectorized functions.

Use vectorized functions on columns when available.

Many common functions are vectorized, such as paste for example. This means you can provide the entire column as input to the function, it will apply the function to each value, and then return the transformed value.

df$A <- paste("number", df$A)

> df
         A B
1 number 1 3
2 number 2 4

There are some functions that are vectorized to work row-wise.

You sometimes have vectorized row-wise functions. They tend to be only for common tasks, so for anything complex you likely won’t have access to it. An example is the rowSums argument.

df <- data.frame(A=1:2, B=3:4)

> rowSums(df)
[1] 4 6

Loooooooong data.

It’s generally good practice to have your data.frame in a long format. I detail this in great depth in my tutorial here, but I will give a brief example here. This is how data is usually stored in a relational database as well, such as MySQL.

Let’s say we have a matrix-like object stored in a data.frame, such as a count matrix for two samples "A" and "B". Ignore for now that simple matrices have powerful vectorization options for the sake of this example.

cts <- df
cts$gene_id <- sprintf("ENSG%06d", seq_len(2))

  A B    gene_id
1 1 3 ENSG000001
2 2 4 ENSG000002

Your data.data frame has three "variables": sample, count, and gene_id. These variables should each be in their own column.

library("tidyverse")

cts <- pivot_longer(cts, !gene_id, names_to="sample", values_to="counts")

> cts
# A tibble: 4 x 3
  gene_id    sample counts
  <chr>      <chr>   <int>
1 ENSG000001 A           1
2 ENSG000001 B           3
3 ENSG000002 A           2
4 ENSG000002 B           4

Long formatted data is advantageous for a few reasons. For example, the counts are all in one column, which corresponds to one element of a list, or one vector. This means you can use a vectorized function on all counts at once, such as our trusty log function.

cts$counts <- log(cts$counts)

> cts
# A tibble: 4 x 3
  gene_id    sample counts
  <chr>      <chr>   <dbl>
1 ENSG000001 A       0    
2 ENSG000001 B       1.10 
3 ENSG000002 A       0.693
4 ENSG000002 B       1.39

Also, now that samples are represented in one row, you can perform column-wise operations that would have once been row-wise. For example, we could get the mean expression for each gene.

cts %>%
  group_by(gene_id) %>%
  summarize(mean_counts=mean(counts))

# A tibble: 2 x 2
  gene_id    mean_counts
  <chr>         <dbl>
1 ENSG000001    0.549
2 ENSG000002    1.04

Concluding remarks.

As genomics datasets get larger, it becomes more important to take advantage of function vectorization when working with data.frames. Remember that data.frame columns are list elements, and because of this working column-wise will almost always be faster than row-wise. Always take advantage of vectorized functions when possible, seek out vectorized row-wise functions if needed, and consider holding your data in long format.

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /geode2/home/u070/rpolicas/Carbonate/.conda/envs/R/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] sloop_1.0.1     lobstr_1.1.1    forcats_0.5.1   stringr_1.4.0  
 [5] dplyr_1.0.5     purrr_0.3.4     readr_1.4.0     tidyr_1.1.3    
 [9] tibble_3.1.0    ggplot2_3.3.3   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6       cellranger_1.1.0 pillar_1.5.1     compiler_4.0.3  
 [5] dbplyr_2.1.0     tools_4.0.3      jsonlite_1.7.2   lubridate_1.7.10
 [9] lifecycle_1.0.0  gtable_0.3.0     pkgconfig_2.0.3  rlang_0.4.10    
[13] reprex_1.0.0     cli_2.3.1        rstudioapi_0.13  DBI_1.1.1       
[17] haven_2.3.1      withr_2.4.1      xml2_1.3.2       httr_1.4.2      
[21] fs_1.5.0         generics_0.1.0   vctrs_0.3.6      hms_1.0.0       
[25] grid_4.0.3       tidyselect_1.1.0 glue_1.4.2       R6_2.5.0        
[29] fansi_0.4.2      readxl_1.3.1     modelr_0.1.8     magrittr_2.0.1  
[33] ps_1.6.0         backports_1.2.1  scales_1.1.1     ellipsis_0.3.1  
[37] rvest_1.0.0      assertthat_0.2.1 colorspace_2.0-0 utf8_1.2.1      
[41] stringi_1.5.3    munsell_0.5.0    broom_0.7.5      crayon_1.4.1

R • 2.3k views

ADD COMMENT • link 3.1 years ago by rpolicastro 13k