How do i calculate the mean of triplicates in a data.frame based on pattern?
3
1
Entering edit mode
26 days ago
Assa Yeroslaviz ★ 1.9k

I have a data.frame with six samples in two triplicates. I would like to calculates for each group the mean of the this group into a new column. What is the best way to do this? My data looks like that

head(p767)
sgrna p767.AM_1 p767.AM_2 p767.AM_3 p767.MM_1 p767.MM_2 p767.MM_3
1  Control-104-KO-64-F       758       682       713      1867      1840      1251
2  Control-110-KO-29-F       674       561       642      1706      1737      1141
3  Control-112-KO-65-F       584       518       495      1328      1448      1012
4  Control-125-KO-31-F       565       462       569      1345      1258       856
5 Control-128-KO-136-F       463       452       424      1253      1258       721
6  Control-132-KO-57-F       854       698       755      1974      1977      1326


I would like to a new data.frame with the mean values calculated as such:

head(p767.m)
p767.AM p767.MM
Control-104-KO-64-F  717.6667 1652.667
Control-110-KO-29-F  625.6667 1528.000
Control-112-KO-65-F  532.3333 1262.667
Control-125-KO-31-F  532.0000 1153.000
Control-128-KO-136-F 446.3333 1077.333
Control-132-KO-57-F  769.0000 1759.000


Below is the way to read it into R. I would appreciate the help.

Thanks

structure(list(sgrna = c("Control-104-KO-64-F", "Control-110-KO-29-F",
"Control-112-KO-65-F", "Control-125-KO-31-F", "Control-128-KO-136-F",
"Control-132-KO-57-F"), p767.AM_1 = c(758L, 674L, 584L, 565L,
463L, 854L), p767.AM_2 = c(682L, 561L, 518L, 462L, 452L, 698L
), p767.AM_3 = c(713L, 642L, 495L, 569L, 424L, 755L), p767.MM_1 = c(1867L,
1706L, 1328L, 1345L, 1253L, 1974L), p767.MM_2 = c(1840L, 1737L,
1448L, 1258L, 1258L, 1977L), p767.MM_3 = c(1251L, 1141L, 1012L,
856L, 721L, 1326L)), row.names = c(NA, 6L), class = "data.frame")

tidyverse r • 679 views
0
Entering edit mode

I know it can be done with something like that:

cbind(p767.AM = (rowMeans(p767[,2:4])),
p767.MM = (rowMeans(p767[,5:7])))


but it would be nice to have a tidyer way of doing it.

4
Entering edit mode
26 days ago

I'm a happy user of data.table:

library(data.table)

dat <- structure(list(sgrna = c("Control-104-KO-64-F", "Control-110-KO-29-F",
"Control-112-KO-65-F", "Control-125-KO-31-F", "Control-128-KO-136-F",
"Control-132-KO-57-F"), p767.AM_1 = c(758L, 674L, 584L, 565L,
463L, 854L), p767.AM_2 = c(682L, 561L, 518L, 462L, 452L, 698L
), p767.AM_3 = c(713L, 642L, 495L, 569L, 424L, 755L), p767.MM_1 = c(1867L,
1706L, 1328L, 1345L, 1253L, 1974L), p767.MM_2 = c(1840L, 1737L,
1448L, 1258L, 1258L, 1977L), p767.MM_3 = c(1251L, 1141L, 1012L,
856L, 721L, 1326L)), row.names = c(NA, 6L), class = "data.frame")

setDT(dat)

ldat <- melt(dat, id.vars='sgrna', variable.name='library_id', value.name='count')
ldat[, sample_id := sub('_\\d+\$', '', library_id)]

# Averages in long format:
cntAvg <- ldat[, list(avg=mean(count)), by=list(sgrna, sample_id)]

# If you want to get back to wide format
dcast(cntAvg, sgrna ~ sample_id)


A couple of notes:

• It is almost always better to work with data in long format (or tidy or normalised)

• I would prefer to get the sample names from a sample sheet rather than parsing the library names with some dodgy regex.

1
Entering edit mode

I would prefer to get the sample names from a sample sheet rather than parsing the library names with some dodgy regex.

+1 !!!

Anecdote: I analyzed data last week where they indeed tried to put metadata into the filename, resulting in a 160 character string that needed deparsing (demessification). Please, people, unless it's a really short and simple encoding, please use sample/annotation sheets.

0
Entering edit mode

I would prefer to get the sample names from a sample sheet rather than parsing the library names with some dodgy regex.

Not really sure, what you mean here. I read the whole count matrix from a data file. Do you mean a separate file with a list of sample names( one name per row)?

0
Entering edit mode

Thanks to your help I managed to do it also in tidyverse, by using first pivot_longer creating a new column with group, calculating the mean by grouping the sgrna and group. Then using pivot_wider to convert it to the format i need removing all unnecessary rows.

melted_df <- pivot_longer(dat, cols = starts_with("p"), names_to = "project", values_to = "counts") |>
mutate(group = gsub(pattern = "\\_[1|2|3]", replacement = "", x = project)) |>
group_by(sgrna, group) |> mutate(avg = mean(counts)) |>
distinct(sgrna,group,avg) |>
pivot_wider(id_cols = sgrna, names_from = group, values_from = avg) |> as.data.frame()


thanks again

Still not sure, how you meant it, when you mentioned reading the sample names from a text file.

1
Entering edit mode
24 days ago
gglim ▴ 160
library(dplyr)
df %>% rowwise() %>% mutate(p767.AM = mean(c_across(contains("AM"))))


Maybe this is what you're looking for?

0
Entering edit mode

I guess the OP doesn't want to hard-code p767.AM and "AM"?

0
Entering edit mode

I guess I can live with that. :-)

Can one create a vectors of patterns, e.g. c("AM, "MM") and use it as input to the mutate command?

1
Entering edit mode
meanofcol <- function(df, col) {
df %>% rowwise() %>% mutate("Mean of {{col}}" := mean(c_across(contains({{col}}))))
}
meanofcol(df, col = "AM")


This seems good for me, reference here

0
Entering edit mode
24 days ago
MolGeek ▴ 80

Or just use rowMeans and provide the desired columns (for dataframes with low number of columns):

p767.m <- data.frame(p767.AM = rowMeans(p767[,1:3]) , p767.MM = rowMeans(p767[,4:6])   )

1
Entering edit mode

Works here, but it's not generic or tidy since you have to manually enter the correct numbers and enter the colnames of the output.

0
Entering edit mode

You are right, thats a quick and dirty answer to get the means.

0
Entering edit mode

exactly, this would be needed to be monitored closely, as the structure changes often.