Question: How to split a data.frame and calculate a row-wise average
1
5 weeks ago by
Assa Yeroslaviz1.4k
Munich
Assa Yeroslaviz1.4k wrote:

I have a `data.frame` with 48 columns. the columns are samples in triplicates, the rows are genes. The values are expression values.

For each group of triplicates I would like to calculate the averaged expression per gene, resulting in a new `data.frame` with 16 columns and for each row the average of the three triplicates for each group.

can this be done with the `tidyverse` tools?

thanks

a small example table of 6x9 is here

``````dput(head(normCounts[,1:9]))
structure(c(0, 4.89997034943019, 2.4499851747151, 0, 46.5497183195869,
14.6999110482906, 0.998187766715749, 1.9963755334315, 0, 0.998187766715749,
55.898514936082, 7.98550213372599, 0, 1.57112407949228, 0, 1.57112407949228,
53.4182187027374, 4.71337223847683, 0, 1.25548317693578, 0, 0,
52.7302934313026, 10.0438654154862, 0, 0, 0, 0, 66.3962127189125,
23.2386744516194, 2.18533123780511, 3.27799685670766, 0, 0, 65.5599371341532,
9.83399057012298, 0, 0, 0, 0, 74.1086143860152, 18.9580176336318,
0, 0, 0, 0, 66.8826789069951, 13.376535781399, 0, 0, 0, 0, 50.7776960416371,
13.0791035258762), .Dim = c(6L, 9L), .Dimnames = list(c("ENSMUSG00000103147",
"ENSMUSG00000102269", "ENSMUSG00000096126", "ENSMUSG00000102735",
"ENSMUSG00000098104", "ENSMUSG00000102175"), c("Sample_1", "Sample_2",
"Sample_3", "Sample_4", "Sample_5", "Sample_6", "Sample_7", "Sample_8",
"Sample_9")))
``````
tidyverse data.frame • 159 views
modified 5 weeks ago by ATpoint42k • written 5 weeks ago by Assa Yeroslaviz1.4k

Are the triplicate groups sequential? As in, are Sample_1, Sample_2, and Sample_3 part of the first group?

yes from `Sample_1` to `Sample_48`

3
5 weeks ago by
rpolicastro2.3k
rpolicastro2.3k wrote:

It will be a lot simpler to first make a group key table.

``````library("tidyverse")

groups <- tibble(
sample=colnames(normCounts),
group=rep(seq(1, ncol(normCounts)/3), each=3)
)

> groups
# A tibble: 9 x 2
sample   group
<chr>    <int>
1 Sample_1     1
2 Sample_2     1
3 Sample_3     1
4 Sample_4     2
5 Sample_5     2
6 Sample_6     2
7 Sample_7     3
8 Sample_8     3
9 Sample_9     3
``````

Then you can pivot your count data to long format, join the groups, and group_by/summarize to get the means.

``````mean_exp <- normCounts %>%
as_tibble(rownames="gene") %>%
pivot_longer(starts_with("Sample"), names_to="sample", values_to="counts") %>%
left_join(groups, by="sample") %>%
group_by(gene, group) %>%
summarize(mean_count=mean(counts))

# A tibble: 6 x 3
# Groups:   gene [2]
gene               group mean_count
<chr>              <int>      <dbl>
1 ENSMUSG00000096126     1      0.817
2 ENSMUSG00000096126     2      0
3 ENSMUSG00000096126     3      0
4 ENSMUSG00000098104     1     52.0
5 ENSMUSG00000098104     2     61.6
6 ENSMUSG00000098104     3     63.9
``````

You can pivot back to a wider format if you want too.

``````mean_exp_wider <- pivot_wider(mean_exp, names_from=group, values_from=mean_count)

> mean_exp_wider
# A tibble: 6 x 4
# Groups:   gene [6]
gene                  `1`    `2`   `3`
<chr>               <dbl>  <dbl> <dbl>
1 ENSMUSG00000096126  0.817  0       0
2 ENSMUSG00000098104 52.0   61.6    63.9
3 ENSMUSG00000102175  9.13  14.4    15.1
4 ENSMUSG00000102269  2.82   1.51    0
5 ENSMUSG00000102735  0.856  0       0
6 ENSMUSG00000103147  0.333  0.728   0
``````

thank you very much, it looks very good and straight forward.

3
5 weeks ago by

Solution (works for triplicates only):

``````> data.frame(apply(array(as.matrix(df), c(nrow(df),3, ncol(df)/3)),3, rowMeans), row.names = row.names(df))
X1         X2       X3
ENSMUSG00000103147  0.3327293  0.7284437  0.00000
ENSMUSG00000102269  2.8224900  1.5111600  0.00000
ENSMUSG00000096126  0.8166617  0.0000000  0.00000
ENSMUSG00000102735  0.8564373  0.0000000  0.00000
ENSMUSG00000098104 51.9554840 61.5621478 63.92300
ENSMUSG00000102175  9.1329285 14.3721768 15.13789
``````

input:

``````> df
Sample_1   Sample_2  Sample_3  Sample_4 Sample_5  Sample_6 Sample_7 Sample_8 Sample_9
ENSMUSG00000103147  0.000000  0.9981878  0.000000  0.000000  0.00000  2.185331  0.00000  0.00000   0.0000
ENSMUSG00000102269  4.899970  1.9963755  1.571124  1.255483  0.00000  3.277997  0.00000  0.00000   0.0000
ENSMUSG00000096126  2.449985  0.0000000  0.000000  0.000000  0.00000  0.000000  0.00000  0.00000   0.0000
ENSMUSG00000102735  0.000000  0.9981878  1.571124  0.000000  0.00000  0.000000  0.00000  0.00000   0.0000
ENSMUSG00000098104 46.549718 55.8985149 53.418219 52.730293 66.39621 65.559937 74.10861 66.88268  50.7777
ENSMUSG00000102175 14.699911  7.9855021  4.713372 10.043865 23.23867  9.833991 18.95802 13.37654  13.0791
``````

You should post this as an answer, since it uses array reshaping it would be the fastest.

``````Unit: microseconds
expr       min        lq       mean    median        uq       max neval
ATpoint  1263.523  1385.396  1734.0353  1459.678  1806.964  4183.692   100
rpolicastro 19126.280 25852.912 28991.6633 27842.674 31070.054 45964.087   100
cpad112   188.652   206.134   280.1408   246.997   282.590   850.775   100
``````

it was posted as answer here: How to average replicate data using R (Column and Raw).

another way of the same (replace 3 with number of replicates, for generalization):

``````> t(aggregate(t(df), list(rep(1:(ncol(df)/3),each=3)), mean)[,-1])

[,1]       [,2]     [,3]
ENSMUSG00000103147  0.3327293  0.7284437  0.00000
ENSMUSG00000102269  2.8224900  1.5111600  0.00000
ENSMUSG00000096126  0.8166617  0.0000000  0.00000
ENSMUSG00000102735  0.8564373  0.0000000  0.00000
ENSMUSG00000098104 51.9554840 61.5621478 63.92300
ENSMUSG00000102175  9.1329285 14.3721768 15.13789
``````
2
5 weeks ago by
ATpoint42k
Germany
ATpoint42k wrote:

Not exactly tidyverse (rpolicastro probably knows a one-liner for this) but mostly base R. It assumes that you define a `sample2group` data.frame beforehand that indicates the sample to group membership. I generally recommend this as it is generic if you ever end up with groups where the individual samples are not side-by-side.

``````library(dplyr)
sample2group <- data.frame(Sample = colnames(dat),
Group = unlist(lapply(paste0("group", seq(1,ncol(dat)/3)), function(x) rep(x, 3))))

averaged.byGroup <- lapply(unique(sample2group\$Group), function(x){

df <- data.frame(rowMeans(dat[,sample2group[sample2group\$Group == x,]\$Sample]))
colnames(df) <- x
return(df)

}) %>% do.call(cbind, .)

group1     group2   group3
ENSMUSG00000103147  0.3327293  0.7284437  0.00000
ENSMUSG00000102269  2.8224900  1.5111600  0.00000
ENSMUSG00000096126  0.8166617  0.0000000  0.00000
ENSMUSG00000102735  0.8564373  0.0000000  0.00000
ENSMUSG00000098104  51.9554840 61.5621478 63.92300
ENSMUSG00000102175  9.1329285  14.3721768 15.13789
``````

Thanks by the way for providing `dput`, that is how it should be, thumbs up!

awesome solution as well, thanks

1

And much faster, as often when performing low-level operations where the actual power of tidyverse is not required. This is simply because of all the code tidyverse has to go through internally (most of which is not required as this is ver low-level here, just an easy averaging).

``````Unit: milliseconds
expr       min        lq      mean    median        uq      max neval
ATpoint  1.255104  1.415695  2.164281  1.885048  2.641237 11.39588   100
rpolicastro 19.213340 27.020071 31.849586 31.094625 34.623525 51.28777   100
``````

I figured I would post a tidyverse solution since they asked for one, but indeed base R and data.table allows for much faster solutions.

1

Sure, your is the requested one. I just said this because people (including myself) often thought that tidyverse is per se faster, but this is rather true in complex situations where a lot of sorting and filtering is required. For low-level operations base R (which then probably links to some C/Cpp bindings) is often the fastest.

2

I definitely agree on that point. Tidyverse isn't faster than base R or data.table for most things since it's designed more so for convenience. I definitely recommend people to get comfortable with base R before learning tidyverse.