How to split a data.frame and calculate a row-wise average
3
1
Entering edit mode
3.6 years ago
Assa Yeroslaviz ★ 1.8k

I have a data.frame with 48 columns. the columns are samples in triplicates, the rows are genes. The values are expression values.

For each group of triplicates I would like to calculate the averaged expression per gene, resulting in a new data.frame with 16 columns and for each row the average of the three triplicates for each group.

can this be done with the tidyverse tools?

thanks

a small example table of 6x9 is here

dput(head(normCounts[,1:9]))
structure(c(0, 4.89997034943019, 2.4499851747151, 0, 46.5497183195869, 
14.6999110482906, 0.998187766715749, 1.9963755334315, 0, 0.998187766715749, 
55.898514936082, 7.98550213372599, 0, 1.57112407949228, 0, 1.57112407949228, 
53.4182187027374, 4.71337223847683, 0, 1.25548317693578, 0, 0, 
52.7302934313026, 10.0438654154862, 0, 0, 0, 0, 66.3962127189125, 
23.2386744516194, 2.18533123780511, 3.27799685670766, 0, 0, 65.5599371341532, 
9.83399057012298, 0, 0, 0, 0, 74.1086143860152, 18.9580176336318, 
0, 0, 0, 0, 66.8826789069951, 13.376535781399, 0, 0, 0, 0, 50.7776960416371, 
13.0791035258762), .Dim = c(6L, 9L), .Dimnames = list(c("ENSMUSG00000103147", 
"ENSMUSG00000102269", "ENSMUSG00000096126", "ENSMUSG00000102735", 
"ENSMUSG00000098104", "ENSMUSG00000102175"), c("Sample_1", "Sample_2", 
"Sample_3", "Sample_4", "Sample_5", "Sample_6", "Sample_7", "Sample_8", 
"Sample_9")))
data.frame tidyverse • 1.8k views
ADD COMMENT
0
Entering edit mode

Are the triplicate groups sequential? As in, are Sample_1, Sample_2, and Sample_3 part of the first group?

ADD REPLY
0
Entering edit mode

yes from Sample_1 to Sample_48

ADD REPLY
3
Entering edit mode
3.6 years ago

It will be a lot simpler to first make a group key table.

library("tidyverse")

groups <- tibble(
  sample=colnames(normCounts),
  group=rep(seq(1, ncol(normCounts)/3), each=3)
)

> groups
# A tibble: 9 x 2
  sample   group
  <chr>    <int>
1 Sample_1     1
2 Sample_2     1
3 Sample_3     1
4 Sample_4     2
5 Sample_5     2
6 Sample_6     2
7 Sample_7     3
8 Sample_8     3
9 Sample_9     3

Then you can pivot your count data to long format, join the groups, and group_by/summarize to get the means.

mean_exp <- normCounts %>%
  as_tibble(rownames="gene") %>%
  pivot_longer(starts_with("Sample"), names_to="sample", values_to="counts") %>%
  left_join(groups, by="sample") %>%
  group_by(gene, group) %>%
  summarize(mean_count=mean(counts))

> head(mean_exp)
# A tibble: 6 x 3
# Groups:   gene [2]
  gene               group mean_count
  <chr>              <int>      <dbl>
1 ENSMUSG00000096126     1      0.817
2 ENSMUSG00000096126     2      0    
3 ENSMUSG00000096126     3      0    
4 ENSMUSG00000098104     1     52.0  
5 ENSMUSG00000098104     2     61.6  
6 ENSMUSG00000098104     3     63.9

You can pivot back to a wider format if you want too.

mean_exp_wider <- pivot_wider(mean_exp, names_from=group, values_from=mean_count)

> mean_exp_wider
# A tibble: 6 x 4
# Groups:   gene [6]
  gene                  `1`    `2`   `3`
  <chr>               <dbl>  <dbl> <dbl>
1 ENSMUSG00000096126  0.817  0       0  
2 ENSMUSG00000098104 52.0   61.6    63.9
3 ENSMUSG00000102175  9.13  14.4    15.1
4 ENSMUSG00000102269  2.82   1.51    0  
5 ENSMUSG00000102735  0.856  0       0  
6 ENSMUSG00000103147  0.333  0.728   0
ADD COMMENT
0
Entering edit mode

thank you very much, it looks very good and straight forward.

ADD REPLY
3
Entering edit mode
3.6 years ago

Solution (works for triplicates only):

> data.frame(apply(array(as.matrix(df), c(nrow(df),3, ncol(df)/3)),3, rowMeans), row.names = row.names(df))
                           X1         X2       X3
ENSMUSG00000103147  0.3327293  0.7284437  0.00000
ENSMUSG00000102269  2.8224900  1.5111600  0.00000
ENSMUSG00000096126  0.8166617  0.0000000  0.00000
ENSMUSG00000102735  0.8564373  0.0000000  0.00000
ENSMUSG00000098104 51.9554840 61.5621478 63.92300
ENSMUSG00000102175  9.1329285 14.3721768 15.13789

input:

> df
                    Sample_1   Sample_2  Sample_3  Sample_4 Sample_5  Sample_6 Sample_7 Sample_8 Sample_9
ENSMUSG00000103147  0.000000  0.9981878  0.000000  0.000000  0.00000  2.185331  0.00000  0.00000   0.0000
ENSMUSG00000102269  4.899970  1.9963755  1.571124  1.255483  0.00000  3.277997  0.00000  0.00000   0.0000
ENSMUSG00000096126  2.449985  0.0000000  0.000000  0.000000  0.00000  0.000000  0.00000  0.00000   0.0000
ENSMUSG00000102735  0.000000  0.9981878  1.571124  0.000000  0.00000  0.000000  0.00000  0.00000   0.0000
ENSMUSG00000098104 46.549718 55.8985149 53.418219 52.730293 66.39621 65.559937 74.10861 66.88268  50.7777
ENSMUSG00000102175 14.699911  7.9855021  4.713372 10.043865 23.23867  9.833991 18.95802 13.37654  13.0791
ADD COMMENT
0
Entering edit mode

You should post this as an answer, since it uses array reshaping it would be the fastest.

ADD REPLY
0
Entering edit mode

cpad0112 takes the lead:

Unit: microseconds
        expr       min        lq       mean    median        uq       max neval
 ATpoint  1263.523  1385.396  1734.0353  1459.678  1806.964  4183.692   100
 rpolicastro 19126.280 25852.912 28991.6633 27842.674 31070.054 45964.087   100
 cpad112   188.652   206.134   280.1408   246.997   282.590   850.775   100
ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

another way of the same (replace 3 with number of replicates, for generalization):

> t(aggregate(t(df), list(rep(1:(ncol(df)/3),each=3)), mean)[,-1])

                         [,1]       [,2]     [,3]
ENSMUSG00000103147  0.3327293  0.7284437  0.00000
ENSMUSG00000102269  2.8224900  1.5111600  0.00000
ENSMUSG00000096126  0.8166617  0.0000000  0.00000
ENSMUSG00000102735  0.8564373  0.0000000  0.00000
ENSMUSG00000098104 51.9554840 61.5621478 63.92300
ENSMUSG00000102175  9.1329285 14.3721768 15.13789
ADD REPLY
2
Entering edit mode
3.6 years ago
ATpoint 82k

Not exactly tidyverse (rpolicastro probably knows a one-liner for this) but mostly base R. It assumes that you define a sample2group data.frame beforehand that indicates the sample to group membership. I generally recommend this as it is generic if you ever end up with groups where the individual samples are not side-by-side.

library(dplyr)
sample2group <- data.frame(Sample = colnames(dat),
                           Group = unlist(lapply(paste0("group", seq(1,ncol(dat)/3)), function(x) rep(x, 3))))

averaged.byGroup <- lapply(unique(sample2group$Group), function(x){

  df <- data.frame(rowMeans(dat[,sample2group[sample2group$Group == x,]$Sample]))
  colnames(df) <- x
  return(df)

}) %>% do.call(cbind, .)

                    group1     group2   group3
ENSMUSG00000103147  0.3327293  0.7284437  0.00000
ENSMUSG00000102269  2.8224900  1.5111600  0.00000
ENSMUSG00000096126  0.8166617  0.0000000  0.00000
ENSMUSG00000102735  0.8564373  0.0000000  0.00000
ENSMUSG00000098104  51.9554840 61.5621478 63.92300
ENSMUSG00000102175  9.1329285  14.3721768 15.13789

Thanks by the way for providing dput, that is how it should be, thumbs up!

ADD COMMENT
0
Entering edit mode

awesome solution as well, thanks

ADD REPLY
1
Entering edit mode

And much faster, as often when performing low-level operations where the actual power of tidyverse is not required. This is simply because of all the code tidyverse has to go through internally (most of which is not required as this is ver low-level here, just an easy averaging).

Unit: milliseconds
 expr       min        lq      mean    median        uq      max neval
 ATpoint  1.255104  1.415695  2.164281  1.885048  2.641237 11.39588   100
 rpolicastro 19.213340 27.020071 31.849586 31.094625 34.623525 51.28777   100
ADD REPLY
0
Entering edit mode

I figured I would post a tidyverse solution since they asked for one, but indeed base R and data.table allows for much faster solutions.

ADD REPLY
1
Entering edit mode

Sure, your is the requested one. I just said this because people (including myself) often thought that tidyverse is per se faster, but this is rather true in complex situations where a lot of sorting and filtering is required. For low-level operations base R (which then probably links to some C/Cpp bindings) is often the fastest.

ADD REPLY
2
Entering edit mode

I definitely agree on that point. Tidyverse isn't faster than base R or data.table for most things since it's designed more so for convenience. I definitely recommend people to get comfortable with base R before learning tidyverse.

ADD REPLY
0
Entering edit mode

This is a nice observation. I mean, it doesn't really affect my data, but for bigger data sets, it might be of great value. thanks.

ADD REPLY

Login before adding your answer.

Traffic: 2552 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6