Entering edit mode
3.3 years ago
mohammedtoufiq91
▴
260
Hi,
I am working with the RNA-Seq dataset and have raw counts file with me. I notice that, there are 58785 genes in the "Gene Symbol" column and some genes are repeated twice (shown below).In this scenario, what is the best practice to handle these types of genes? Do we simply average
them or sum
them before using them in downstream analysis?
dput(head(Counts, 5))
structure(list(symbol = c("BM", "A2GGG", "A2GGG", "P1P",
"P1P"), Sample_A = c(0L, 0L, 82L, 46L, 6L), Sample_B = c(1L,
0L, 64L, 49L, 5L), Sample_C = c(2L, 0L, 96L, 44L, 6L), Sample_D = c(5L,
0L, 85L, 38L, 3L), Sample_E = c(1L, 0L, 80L, 48L, 6L), Sample_F = c(1L,
0L, 77L, 49L, 4L)), row.names = c(NA, 5L), class = "data.frame")
Average
(A2GGG + A2GGG)/2 = A2GGG
Sum
A2GGG + A2GGG = A2GGG
Thank you,
Toufiq
How did you generate the counts and how was the raw data processed?
And what genome version did you use?
I would prefer to take median instead of mean.
Median and mean are the same when having only two values.
ATpoint, thank you very much.
For the past data analysis experiments, I have used mean using the following
So, I understand it is OK to use either
mean
ormedian
right? Any inputs about usage ofsum
for aggregating the counts?Are there any specific scenario, when mean, median or sum should be utilized?
I am not sure whether this makes sense. I know that duplicated gene names are a pain but these have unique Ensembl Gene IDs and come from different genomic coordinates, so average is suboptimal. Why not just using like
EnsemblGeneID_GeneName
as an identifier, so a concat of Ensembl and gene name? Then you can simply keep all genes. Or make them unique, like Gene1, and Gene1a, something like this, and then only care if they end up being differential. If not simply forget about them. Just thinking aloud.ATpoint, thank you.
The reason why I am trying to collapse the data into one single value is because, using this gene level matrix I would be mapping/merging to another third party gene annotation database. In case, If I use
make.unique ()
then the genes renamed by a suffix (for instance Gene1, Gene1a ...) will be lost during the mapping process since the third party database would only contain gene (Gene1, but lack Gene1a or Gene1b etc). So it is important for me to include a averaged gene value while mapping.