how to get ride of duplicated genes when we also have duplicated Ensemble ID in the expression profile?
0
0
Entering edit mode
3.6 years ago
Raheleh ▴ 260

Hi all,

I have a mouse expression profile that is annotated with gene symbols and many of them are duplicated. I usually use collapseRows function with maxMean method from WGCNA package to get ride of duplicated genes. However, this time I realized that there are also some duplication in ENSEMBLE IDs. Can any help me how should I deal with this situation? Should I just simply remove duplicated ENSEMBLE ID and then use collapseRows function for duplicated genes? This is part of my data:

ENSMUSG00000019864  Rtn4ip1 3.33471 2.18619 3.52304 4.13997 2.91682 3.17805
ENSMUSG00000019864  Rtn4ip1 0.141481    0   0.126809    0.140919    0   0.159667
ENSMUSG00000019865  Nmbr    0.0325972   0   0.056908    0.0324288   0.305734    0
ENSMUSG00000019866  Crybg1  8.79001 6.82754 13.9235 15.1803 9.54965 11.3725
ENSMUSG00000019867  Gje1    0   0   0   0   0   0

as you can see for example ENSMUSG00000019864 id is duplicated with different expression value?

I really appreciate any help or suggestion!

RNA-Seq duplicated ENSEMBLE ID collapseRows • 1.9k views
ADD COMMENT
2
Entering edit mode

Looks like you have Transcript expression reported, I would prefer to add the values for the same condition in the same gene, eo Rtn4ip1 should be:

ENSMUSG00000019864  Rtn4ip1 3.33471+0.141481 2.18619+0 3.52304+0.126809  4.13997+0.140919  2.91682+0 3.17805+0.159667
ADD REPLY
0
Entering edit mode

What about getting average instead? Is there any r package that can do this?

ADD REPLY
1
Entering edit mode

If they are transcripts it would make more sense to add them to get gene expression values, since all of those sequencing reads aligned to a transcript from the same gene.

ADD REPLY
0
Entering edit mode

Thanks rpolicastro! Oh yes that makes more sense. Is there any package for doing this in r?

ADD REPLY
1
Entering edit mode

You can use dplyr. Make sure you have dplyr v1.0.0 or higher.

library("dplyr")

df <- df %>%
  group_by(across(c(1, 2))) %>%
  summarize(across(everything(), sum))
ADD REPLY
0
Entering edit mode

Where did you get the expression profile from, and/or how was it generated? It would be good to first figure out how it ended up with duplicated values.

ADD REPLY
0
Entering edit mode

I got from someone as she said this is FPKM data from Cufflinks pipeline.

ADD REPLY
0
Entering edit mode

Hi, I've got the same problem. Did you figure out how it ended up with duplicated values? I am not sure if this is transcript expression.

ADD REPLY

Login before adding your answer.

Traffic: 3123 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6