Question

Tutorial:Long versus wide data for plotting genomics data in R.

12

Entering edit mode

4.5 years ago

rpolicastro 13k

I've found that the data structure required for plotting can sometimes be confusing for people first learning R. More specifically, the concept of long versus wide data. In order to illustrate this concept, I will show an example of making a simple box plot of expression from a toy RNA-seq count matrix.

Other R tutorials:

I first generate some example data to work with.

mat <- cbind(
  t(replicate(4, rnorm(3, runif(1, 10, 500), runif(1, 1, 25)))),
  t(replicate(4, rnorm(3, runif(1, 10, 500), runif(1, 1, 25))))
)
dimnames(mat) <- list(
  sprintf("ENS%06d", seq(100, 103)),
  c(sprintf("EV_%02d", seq_len(3)), sprintf("KD_%02d", seq_len(3))) 
)

> mat
             EV_01    EV_02    EV_03    KD_01    KD_02    KD_03
ENS000100 273.1218 274.6152 264.7982 216.8490 198.8390 220.9580
ENS000101 514.5558 456.3855 487.5547 342.2199 329.2495 339.9095
ENS000102 332.7703 342.6765 338.4658 291.3258 291.4720 299.3414
ENS000103 476.9946 479.9996 515.0222 200.8888 150.8901 147.1827

When thinking about long versus wide data, I like to think about how many "variables" my data has. In this count matrix we have genes (row names), samples (column names), and expression (matrix elements). To plot this data with ggplot2, each one of these variables should be their own column, or in long format. Right now the samples variable (and accompanying expression data) is spread among multiple columns, making this data wide format.

There are multiple ways to convert this data from wide to long format. First though, we should turn this count matrix into a data.frame, making sure that the gene names are their own variable column (instead of rownames). I provide base R, data.table, and tidyverse solutions, but will be using the tidyverse functions as display.

# base R
df <- as.data.frame(mat)
df$genes <- rownames(df)
rownames(df) <- NULL

# data.table
library("data.table")
DT <- as.data.table(mat, keep.rownames="genes")

# Tidyverse
library("dplyr")
library("tidyr")
df <- as_tibble(mat, rownames="genes")

> df
# A tibble: 4 x 7
  genes     EV_01 EV_02 EV_03 KD_01 KD_02 KD_03
  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENS000100  273.  275.  265.  217.  199.  221.
2 ENS000101  515.  456.  488.  342.  329.  340.
3 ENS000102  333.  343.  338.  291.  291.  299.
4 ENS000103  477.  480.  515.  201.  151.  147.

Now that you have a properly formatted data.frame, you can change your data from wide to long format. For all methods described below, there will be a few important arguments. First, you need to specify the columns you want to collapse, in this case the columns with the expression values. Second, the column names (samples) will be placed into a single column, so you must name this column. Third, the elements of the collapsed columns (expression) will also be placed into a single column, so you must name that as well. I use grep to get all of the column names I want to collapse, but they can also be specified manually in a vector.

# base R
df <- reshape(
  df, direction="long", idvar="genes", v.names="expression", timevar="sample",
  varying=grep("EV|KD", colnames(df), value=TRUE), 
  times=grep("EV|KD", colnames(df), value=TRUE)
) 

# data.table
DT <- melt(
  DT, measure.vars=grep("EV|KD", colnames(DT), value=TRUE),
  variable.name="sample", value.name="expression"
)

# Tidyverse
df <- pivot_longer(df, -genes, names_to="sample", values_to="expression")

> head(df)
# A tibble: 6 x 3
  genes     sample expression
  <chr>     <chr>       <dbl>
1 ENS000100 EV_01        273.
2 ENS000100 EV_02        275.
3 ENS000100 EV_03        265.
4 ENS000100 KD_01        217.
5 ENS000100 KD_02        199.
6 ENS000100 KD_03        221.

You can now notice that our variables genes, samples, and expression are all represented in columns. This means our data is now in long format. Before we plot we are going to separate the sample type from the replicate number, so that the sample types can be plotted together. This can actually be done simultaneously with the pivoting functions with this example, but is separated here for clarity.

# base R
df <- cbind(df, data.frame(do.call('rbind', strsplit(as.character(df$sample),'_'))))
df$sample <- NULL
colnames(df)[colnames(df) %in% c("X1", "X2")] <- c("sample", "replicate")

# data.table
DT[, c("sample", "replicate") := tstrsplit(sample, "_")]

# Tidyverse
df <- separate(df, sample, into=c("sample", "replicate"), sep="_")

> head(df)
# A tibble: 6 x 4
  genes     sample replicate expression
  <chr>     <chr>  <chr>          <dbl>
1 ENS000100 EV     01              273.
2 ENS000100 EV     02              275.
3 ENS000100 EV     03              265.
4 ENS000100 KD     01              217.
5 ENS000100 KD     02              199.
6 ENS000100 KD     03              221.

Finally, we can use out newly formatted long data for plotting using ggplot.

library("ggplot2")

ggplot(df, aes(x=genes, y=log2(expression), fill=sample)) +
  geom_boxplot(position="dodge", width=0.25) +
  theme(text=element_text(size=16), axis.text.x=element_text(angle=45, hjust=1))

box plot

For more on pivoting data see vignette("datatable-reshape", package="data.table") for data.table, and vignette("pivot", package="tidyr") for tidyverse.

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /geode2/home/u070/rpolicas/Carbonate/.conda/envs/bioconductor/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.3.2     tidyr_1.1.1       dplyr_1.0.1       data.table_1.12.8

loaded via a namespace (and not attached):
 [1] withr_2.2.0      crayon_1.3.4     grid_4.0.2       R6_2.4.1        
 [5] gtable_0.3.0     lifecycle_0.2.0  magrittr_1.5     scales_1.1.1    
 [9] pillar_1.4.6     rlang_0.4.7      vctrs_0.3.2      generics_0.0.2  
[13] ellipsis_0.3.1   glue_1.4.1       munsell_0.5.0    purrr_0.3.4     
[17] compiler_4.0.2   colorspace_1.4-1 pkgconfig_2.0.3  tidyselect_1.1.0
[21] tibble_3.0.3

ggplot2 tidyverse data.table R • 3.2k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 4.5 years ago by rpolicastro 13k