Averaging duplicates of a gene in RNA-Seq dataset
0
5
Entering edit mode
2.8 years ago

Hi,

I am working with the RNA-Seq dataset and have raw counts file with me. I notice that, there are 58785 genes in the "Gene Symbol" column and some genes are repeated twice (shown below).In this scenario, what is the best practice to handle these types of genes? Do we simply average them or sum them before using them in downstream analysis?

dput(head(Counts, 5))
structure(list(symbol = c("BM", "A2GGG", "A2GGG", "P1P", 
"P1P"), Sample_A = c(0L, 0L, 82L, 46L, 6L), Sample_B = c(1L, 
0L, 64L, 49L, 5L), Sample_C = c(2L, 0L, 96L, 44L, 6L), Sample_D = c(5L, 
0L, 85L, 38L, 3L), Sample_E = c(1L, 0L, 80L, 48L, 6L), Sample_F = c(1L, 
0L, 77L, 49L, 4L)), row.names = c(NA, 5L), class = "data.frame")

Average

(A2GGG + A2GGG)/2 = A2GGG

Sum

A2GGG + A2GGG = A2GGG

Thank you,

Toufiq

expression differential average R rna-seq • 2.7k views
ADD COMMENT
1
Entering edit mode

How did you generate the counts and how was the raw data processed?

ADD REPLY
0
Entering edit mode

And what genome version did you use?

ADD REPLY
0
Entering edit mode

I would prefer to take median instead of mean.

ADD REPLY
0
Entering edit mode

Median and mean are the same when having only two values.

ADD REPLY
0
Entering edit mode

ATpoint, thank you very much.

For the past data analysis experiments, I have used mean using the following

Counts = aggregate(Counts,FUN = mean,by=list(Counts$symbol))

So, I understand it is OK to use either mean or median right? Any inputs about usage of sum for aggregating the counts?

Are there any specific scenario, when mean, median or sum should be utilized?

ADD REPLY
3
Entering edit mode

I am not sure whether this makes sense. I know that duplicated gene names are a pain but these have unique Ensembl Gene IDs and come from different genomic coordinates, so average is suboptimal. Why not just using like EnsemblGeneID_GeneName as an identifier, so a concat of Ensembl and gene name? Then you can simply keep all genes. Or make them unique, like Gene1, and Gene1a, something like this, and then only care if they end up being differential. If not simply forget about them. Just thinking aloud.

ADD REPLY
0
Entering edit mode

ATpoint, thank you.

The reason why I am trying to collapse the data into one single value is because, using this gene level matrix I would be mapping/merging to another third party gene annotation database. In case, If I use make.unique () then the genes renamed by a suffix (for instance Gene1, Gene1a ...) will be lost during the mapping process since the third party database would only contain gene (Gene1, but lack Gene1a or Gene1b etc). So it is important for me to include a averaged gene value while mapping.

ADD REPLY

Login before adding your answer.

Traffic: 1778 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6