Question

How to deal with the case that one gene symbol matches multiple ensembl ids?

2

Entering edit mode

5.4 years ago

dz2353 ▴ 120

Dear friends,

I am a cookie in RNA-Seq analysis so I am really confused with the Ensembl id and gene symbol. I have checked some associated posts in Biostars but I didn't find what I want eventually. The problem is that I noticed one Ensembl id matches multiple gene symbols, but I do not know how to deal with this issue when I want to do some analysis based on gene symbols. Should I add the counts of the same symbol together? Thanks in advance for the reply!

RNA-Seq ensembl • 8.1k views

ADD COMMENT • link updated 3.4 years ago by Biostar 20 • written 5.4 years ago by dz2353 ▴ 120

0

Entering edit mode

Hello dz2353 ,

could you please post an example?

fin swimmer

ADD REPLY • link 5.4 years ago by finswimmer 16k

0

Entering edit mode

Are you sure its "multiple ensemble ids for a gene symbol" or "multiple gene symbols for an ensemble id"?

ADD REPLY • link 5.4 years ago by Arup Ghosh 3.2k

0

Entering edit mode

Sorry, I made a mistake. Thanks for @arup's comment. I correct my question as an Ensembl id matches multiple gene symbols. And I post a pic for my data and you can see that there are so many RF00019. Should I add all of them together? ! https://ibb.co/9T1xkBD

ADD REPLY • link 5.4 years ago by dz2353 ▴ 120

0

Entering edit mode

the issue is with the .decimal point after your ensg ID remove them .

ADD REPLY • link 5.4 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

I know that the part after decimal point represents the version. So what I need to do is to delete the parts after .decimal point and then transfer the Ensembl id to gene symbol? Thanks in advance for the reply!

ADD REPLY • link 5.4 years ago by dz2353 ▴ 120

0

Entering edit mode

yes..thats what you have to do

ADD REPLY • link 5.4 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Thank you! I will try it.

ADD REPLY • link 5.4 years ago by dz2353 ▴ 120

score 8 · Accepted Answer · 2018-12-03

8

Entering edit mode

5.4 years ago

Kristoffer Vitting-Seerup ★ 4.0k

You have run into the problem that in the human genome there are instances of gene_names which are associated with multiple genomic loci (RF0019 in the link you posted). Since they are associated with different loci they also have different gene_ids. Last time I checked there were ~100 such gene_names in the human genome - many of which are located on different chromosomes.

I would always analyze the data with gene_ids (!) simply because else you assume the different loci produce identical products which might or might not be the case. Furthermore the gene_id analysis lets you analyze different things such as regulation and isoform switches. Lastly if you want to do any downstream analysis (go-terms or gene-set enrichment analysis etc) you should NEVER use gene_names. The problem is that in many cases gene_names are to unspecific with many different gene names pointing to the same gene and multiple genes all pointed to by a single gene name.

Cheers Kristoffer

ADD COMMENT • link 5.4 years ago by Kristoffer Vitting-Seerup ★ 4.0k

0

Entering edit mode

Well, thanks for your answer, Kristoffer. So what you mean is that I use gene id instead of gene name or symbols? But I still have some points unclear. You mentioned that you always analyze the data with gene ids (here is the Ensembl ID ), so what kind of analysis you do? I want to perform a PCA analysis by using these data in order to test the structure of these samples, I mean to see if the same kind of samples cluster together or not. And could you please give me a more detailed explanation for why do not use gene name(symbols)?

ADD REPLY • link 5.4 years ago by dz2353 ▴ 120

4

Entering edit mode

Yes just use gene_ids (such as Ensemble ids). Using that you get an expression matrix just like normal (except slightly larger) which you can do all downstream analysis with. Which "name" you assign to a gene (gene name, symbol or id) does not matter for the downstream analysis.

With regards to why not to use gene symbols the main reason is that if you do not use gene id you will just sum up the expression of all the loci and thereby assume the different loci produce proteins with the exact same function. This might be true for some cases but in general that is to bold an assumption. Furthermore it might reduce you power for some downstream analysis since you loose the genomic context of the gene. Lastly as explained above - for any systems biology gene names are very problematic since they often refere to multiple non-identical protein product for historical reasons.