Question: How to split a data.frame and calculate a row-wise average
1
gravatar for Assa Yeroslaviz
5 weeks ago by
Assa Yeroslaviz1.4k
Munich
Assa Yeroslaviz1.4k wrote:

I have a data.frame with 48 columns. the columns are samples in triplicates, the rows are genes. The values are expression values.

For each group of triplicates I would like to calculate the averaged expression per gene, resulting in a new data.frame with 16 columns and for each row the average of the three triplicates for each group.

can this be done with the tidyverse tools?

thanks

a small example table of 6x9 is here

dput(head(normCounts[,1:9]))
structure(c(0, 4.89997034943019, 2.4499851747151, 0, 46.5497183195869, 
14.6999110482906, 0.998187766715749, 1.9963755334315, 0, 0.998187766715749, 
55.898514936082, 7.98550213372599, 0, 1.57112407949228, 0, 1.57112407949228, 
53.4182187027374, 4.71337223847683, 0, 1.25548317693578, 0, 0, 
52.7302934313026, 10.0438654154862, 0, 0, 0, 0, 66.3962127189125, 
23.2386744516194, 2.18533123780511, 3.27799685670766, 0, 0, 65.5599371341532, 
9.83399057012298, 0, 0, 0, 0, 74.1086143860152, 18.9580176336318, 
0, 0, 0, 0, 66.8826789069951, 13.376535781399, 0, 0, 0, 0, 50.7776960416371, 
13.0791035258762), .Dim = c(6L, 9L), .Dimnames = list(c("ENSMUSG00000103147", 
"ENSMUSG00000102269", "ENSMUSG00000096126", "ENSMUSG00000102735", 
"ENSMUSG00000098104", "ENSMUSG00000102175"), c("Sample_1", "Sample_2", 
"Sample_3", "Sample_4", "Sample_5", "Sample_6", "Sample_7", "Sample_8", 
"Sample_9")))
tidyverse data.frame • 159 views
ADD COMMENTlink modified 5 weeks ago by ATpoint42k • written 5 weeks ago by Assa Yeroslaviz1.4k

Are the triplicate groups sequential? As in, are Sample_1, Sample_2, and Sample_3 part of the first group?

ADD REPLYlink written 5 weeks ago by rpolicastro2.3k

yes from Sample_1 to Sample_48

ADD REPLYlink written 5 weeks ago by Assa Yeroslaviz1.4k
3
gravatar for rpolicastro
5 weeks ago by
rpolicastro2.3k
rpolicastro2.3k wrote:

It will be a lot simpler to first make a group key table.

library("tidyverse")

groups <- tibble(
  sample=colnames(normCounts),
  group=rep(seq(1, ncol(normCounts)/3), each=3)
)

> groups
# A tibble: 9 x 2
  sample   group
  <chr>    <int>
1 Sample_1     1
2 Sample_2     1
3 Sample_3     1
4 Sample_4     2
5 Sample_5     2
6 Sample_6     2
7 Sample_7     3
8 Sample_8     3
9 Sample_9     3

Then you can pivot your count data to long format, join the groups, and group_by/summarize to get the means.

mean_exp <- normCounts %>%
  as_tibble(rownames="gene") %>%
  pivot_longer(starts_with("Sample"), names_to="sample", values_to="counts") %>%
  left_join(groups, by="sample") %>%
  group_by(gene, group) %>%
  summarize(mean_count=mean(counts))

> head(mean_exp)
# A tibble: 6 x 3
# Groups:   gene [2]
  gene               group mean_count
  <chr>              <int>      <dbl>
1 ENSMUSG00000096126     1      0.817
2 ENSMUSG00000096126     2      0    
3 ENSMUSG00000096126     3      0    
4 ENSMUSG00000098104     1     52.0  
5 ENSMUSG00000098104     2     61.6  
6 ENSMUSG00000098104     3     63.9

You can pivot back to a wider format if you want too.

mean_exp_wider <- pivot_wider(mean_exp, names_from=group, values_from=mean_count)

> mean_exp_wider
# A tibble: 6 x 4
# Groups:   gene [6]
  gene                  `1`    `2`   `3`
  <chr>               <dbl>  <dbl> <dbl>
1 ENSMUSG00000096126  0.817  0       0  
2 ENSMUSG00000098104 52.0   61.6    63.9
3 ENSMUSG00000102175  9.13  14.4    15.1
4 ENSMUSG00000102269  2.82   1.51    0  
5 ENSMUSG00000102735  0.856  0       0  
6 ENSMUSG00000103147  0.333  0.728   0
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by rpolicastro2.3k

thank you very much, it looks very good and straight forward.

ADD REPLYlink written 5 weeks ago by Assa Yeroslaviz1.4k
3
gravatar for cpad0112
5 weeks ago by
cpad011214k
Hyderabad India
cpad011214k wrote:

Solution (works for triplicates only):

> data.frame(apply(array(as.matrix(df), c(nrow(df),3, ncol(df)/3)),3, rowMeans), row.names = row.names(df))
                           X1         X2       X3
ENSMUSG00000103147  0.3327293  0.7284437  0.00000
ENSMUSG00000102269  2.8224900  1.5111600  0.00000
ENSMUSG00000096126  0.8166617  0.0000000  0.00000
ENSMUSG00000102735  0.8564373  0.0000000  0.00000
ENSMUSG00000098104 51.9554840 61.5621478 63.92300
ENSMUSG00000102175  9.1329285 14.3721768 15.13789

input:

> df
                    Sample_1   Sample_2  Sample_3  Sample_4 Sample_5  Sample_6 Sample_7 Sample_8 Sample_9
ENSMUSG00000103147  0.000000  0.9981878  0.000000  0.000000  0.00000  2.185331  0.00000  0.00000   0.0000
ENSMUSG00000102269  4.899970  1.9963755  1.571124  1.255483  0.00000  3.277997  0.00000  0.00000   0.0000
ENSMUSG00000096126  2.449985  0.0000000  0.000000  0.000000  0.00000  0.000000  0.00000  0.00000   0.0000
ENSMUSG00000102735  0.000000  0.9981878  1.571124  0.000000  0.00000  0.000000  0.00000  0.00000   0.0000
ENSMUSG00000098104 46.549718 55.8985149 53.418219 52.730293 66.39621 65.559937 74.10861 66.88268  50.7777
ENSMUSG00000102175 14.699911  7.9855021  4.713372 10.043865 23.23867  9.833991 18.95802 13.37654  13.0791
ADD COMMENTlink written 5 weeks ago by cpad011214k

You should post this as an answer, since it uses array reshaping it would be the fastest.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by rpolicastro2.3k

cpad0112 takes the lead:

Unit: microseconds
        expr       min        lq       mean    median        uq       max neval
 ATpoint  1263.523  1385.396  1734.0353  1459.678  1806.964  4183.692   100
 rpolicastro 19126.280 25852.912 28991.6633 27842.674 31070.054 45964.087   100
 cpad112   188.652   206.134   280.1408   246.997   282.590   850.775   100
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by ATpoint42k

it was posted as answer here: How to average replicate data using R (Column and Raw).

ADD REPLYlink written 5 weeks ago by cpad011214k

another way of the same (replace 3 with number of replicates, for generalization):

> t(aggregate(t(df), list(rep(1:(ncol(df)/3),each=3)), mean)[,-1])

                         [,1]       [,2]     [,3]
ENSMUSG00000103147  0.3327293  0.7284437  0.00000
ENSMUSG00000102269  2.8224900  1.5111600  0.00000
ENSMUSG00000096126  0.8166617  0.0000000  0.00000
ENSMUSG00000102735  0.8564373  0.0000000  0.00000
ENSMUSG00000098104 51.9554840 61.5621478 63.92300
ENSMUSG00000102175  9.1329285 14.3721768 15.13789
ADD REPLYlink written 5 weeks ago by cpad011214k
2
gravatar for ATpoint
5 weeks ago by
ATpoint42k
Germany
ATpoint42k wrote:

Not exactly tidyverse (rpolicastro probably knows a one-liner for this) but mostly base R. It assumes that you define a sample2group data.frame beforehand that indicates the sample to group membership. I generally recommend this as it is generic if you ever end up with groups where the individual samples are not side-by-side.

library(dplyr)
sample2group <- data.frame(Sample = colnames(dat),
                           Group = unlist(lapply(paste0("group", seq(1,ncol(dat)/3)), function(x) rep(x, 3))))

averaged.byGroup <- lapply(unique(sample2group$Group), function(x){

  df <- data.frame(rowMeans(dat[,sample2group[sample2group$Group == x,]$Sample]))
  colnames(df) <- x
  return(df)

}) %>% do.call(cbind, .)

                    group1     group2   group3
ENSMUSG00000103147  0.3327293  0.7284437  0.00000
ENSMUSG00000102269  2.8224900  1.5111600  0.00000
ENSMUSG00000096126  0.8166617  0.0000000  0.00000
ENSMUSG00000102735  0.8564373  0.0000000  0.00000
ENSMUSG00000098104  51.9554840 61.5621478 63.92300
ENSMUSG00000102175  9.1329285  14.3721768 15.13789

Thanks by the way for providing dput, that is how it should be, thumbs up!

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by ATpoint42k

awesome solution as well, thanks

ADD REPLYlink written 5 weeks ago by Assa Yeroslaviz1.4k
1

And much faster, as often when performing low-level operations where the actual power of tidyverse is not required. This is simply because of all the code tidyverse has to go through internally (most of which is not required as this is ver low-level here, just an easy averaging).

Unit: milliseconds
 expr       min        lq      mean    median        uq      max neval
 ATpoint  1.255104  1.415695  2.164281  1.885048  2.641237 11.39588   100
 rpolicastro 19.213340 27.020071 31.849586 31.094625 34.623525 51.28777   100
ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by ATpoint42k

I figured I would post a tidyverse solution since they asked for one, but indeed base R and data.table allows for much faster solutions.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by rpolicastro2.3k
1

Sure, your is the requested one. I just said this because people (including myself) often thought that tidyverse is per se faster, but this is rather true in complex situations where a lot of sorting and filtering is required. For low-level operations base R (which then probably links to some C/Cpp bindings) is often the fastest.

ADD REPLYlink written 5 weeks ago by ATpoint42k
2

I definitely agree on that point. Tidyverse isn't faster than base R or data.table for most things since it's designed more so for convenience. I definitely recommend people to get comfortable with base R before learning tidyverse.

ADD REPLYlink written 5 weeks ago by rpolicastro2.3k

This is a nice observation. I mean, it doesn't really affect my data, but for bigger data sets, it might be of great value. thanks.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Assa Yeroslaviz1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2073 users visited in the last hour