Question: How to deal with the case that one gene symbol matches multiple ensembl ids?
0
gravatar for dz2353
6 weeks ago by
dz235350
dz235350 wrote:

Dear friends,

I am a cookie in RNA-Seq analysis so I am really confused with the Ensembl id and gene symbol. I have checked some associated posts in Biostars but I didn't find what I want eventually. The problem is that I noticed one Ensembl id matches multiple gene symbols, but I do not know how to deal with this issue when I want to do some analysis based on gene symbols. Should I add the counts of the same symbol together? Thanks in advance for the reply!

rna-seq ensembl • 225 views
ADD COMMENTlink modified 6 weeks ago by kristoffer.vittingseerup1.2k • written 6 weeks ago by dz235350

Hello dz2353 ,

could you please post an example?

fin swimmer

ADD REPLYlink written 6 weeks ago by finswimmer8.9k

Are you sure its "multiple ensemble ids for a gene symbol" or "multiple gene symbols for an ensemble id"?

ADD REPLYlink written 6 weeks ago by arup720

Sorry, I made a mistake. Thanks for @arup's comment. I correct my question as an Ensembl id matches multiple gene symbols. And I post a pic for my data and you can see that there are so many RF00019. Should I add all of them together? ! https://ibb.co/9T1xkBD

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by dz235350

the issue is with the .decimal point after your ensg ID remove them .

ADD REPLYlink written 6 weeks ago by krushnach80440

I know that the part after decimal point represents the version. So what I need to do is to delete the parts after .decimal point and then transfer the Ensembl id to gene symbol? Thanks in advance for the reply!

ADD REPLYlink written 6 weeks ago by dz235350

yes..thats what you have to do

ADD REPLYlink written 6 weeks ago by krushnach80440

Thank you! I will try it.

ADD REPLYlink written 6 weeks ago by dz235350
5
gravatar for kristoffer.vittingseerup
6 weeks ago by
European Union
kristoffer.vittingseerup1.2k wrote:

You have run into the problem that in the human genome there are instances of gene_names which are associated with multiple genomic loci (RF0019 in the link you posted). Since they are associated with different loci they also have different gene_ids. Last time I checked there were ~100 such gene_names in the human genome - many of which are located on different chromosomes.

I would always analyze the data with gene_ids (!) simply because else you assume the different loci produce identical products which might or might not be the case. Furthermore the gene_id analysis lets you analyze different things such as regulation and isoform switches. Lastly if you want to do any downstream analysis (go-terms or gene-set enrichment analysis etc) you should NEVER use gene_names. The problem is that in many cases gene_names are to unspecific with many different gene names pointing to the same gene and multiple genes all pointed to by a single gene name.

Cheers Kristoffer

ADD COMMENTlink written 6 weeks ago by kristoffer.vittingseerup1.2k

Well, thanks for your answer, Kristoffer. So what you mean is that I use gene id instead of gene name or symbols? But I still have some points unclear. You mentioned that you always analyze the data with gene ids (here is the Ensembl ID ), so what kind of analysis you do? I want to perform a PCA analysis by using these data in order to test the structure of these samples, I mean to see if the same kind of samples cluster together or not. And could you please give me a more detailed explanation for why do not use gene name(symbols)?

ADD REPLYlink written 6 weeks ago by dz235350
1

Yes just use gene_ids (such as Ensemble ids). Using that you get an expression matrix just like normal (except slightly larger) which you can do all downstream analysis with. Which "name" you assign to a gene (gene name, symbol or id) does not matter for the downstream analysis.

With regards to why not to use gene symbols the main reason is that if you do not use gene id you will just sum up the expression of all the loci and thereby assume the different loci produce proteins with the exact same function. This might be true for some cases but in general that is to bold an assumption. Furthermore it might reduce you power for some downstream analysis since you loose the genomic context of the gene. Lastly as explained above - for any systems biology gene names are very problematic since they often refere to multiple non-identical protein product for historical reasons.

ADD REPLYlink written 6 weeks ago by kristoffer.vittingseerup1.2k
1

Wow, much clear and so appreciate it!

ADD REPLYlink written 6 weeks ago by dz235350
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 748 users visited in the last hour