Question: How to deal with the case that one gene symbol matches multiple ensembl ids?
0
gravatar for dz2353
11 months ago by
dz235380
dz235380 wrote:

Dear friends,

I am a cookie in RNA-Seq analysis so I am really confused with the Ensembl id and gene symbol. I have checked some associated posts in Biostars but I didn't find what I want eventually. The problem is that I noticed one Ensembl id matches multiple gene symbols, but I do not know how to deal with this issue when I want to do some analysis based on gene symbols. Should I add the counts of the same symbol together? Thanks in advance for the reply!

rna-seq ensembl • 826 views
ADD COMMENTlink modified 11 months ago by kristoffer.vittingseerup2.5k • written 11 months ago by dz235380

Hello dz2353 ,

could you please post an example?

fin swimmer

ADD REPLYlink written 11 months ago by finswimmer12k

Are you sure its "multiple ensemble ids for a gene symbol" or "multiple gene symbols for an ensemble id"?

ADD REPLYlink written 11 months ago by arup1.9k

Sorry, I made a mistake. Thanks for @arup's comment. I correct my question as an Ensembl id matches multiple gene symbols. And I post a pic for my data and you can see that there are so many RF00019. Should I add all of them together? ! https://ibb.co/9T1xkBD

ADD REPLYlink modified 11 months ago • written 11 months ago by dz235380

the issue is with the .decimal point after your ensg ID remove them .

ADD REPLYlink written 11 months ago by krushnach80610

I know that the part after decimal point represents the version. So what I need to do is to delete the parts after .decimal point and then transfer the Ensembl id to gene symbol? Thanks in advance for the reply!

ADD REPLYlink written 11 months ago by dz235380

yes..thats what you have to do

ADD REPLYlink written 11 months ago by krushnach80610

Thank you! I will try it.

ADD REPLYlink written 11 months ago by dz235380
5
gravatar for kristoffer.vittingseerup
11 months ago by
European Union
kristoffer.vittingseerup2.5k wrote:

You have run into the problem that in the human genome there are instances of gene_names which are associated with multiple genomic loci (RF0019 in the link you posted). Since they are associated with different loci they also have different gene_ids. Last time I checked there were ~100 such gene_names in the human genome - many of which are located on different chromosomes.

I would always analyze the data with gene_ids (!) simply because else you assume the different loci produce identical products which might or might not be the case. Furthermore the gene_id analysis lets you analyze different things such as regulation and isoform switches. Lastly if you want to do any downstream analysis (go-terms or gene-set enrichment analysis etc) you should NEVER use gene_names. The problem is that in many cases gene_names are to unspecific with many different gene names pointing to the same gene and multiple genes all pointed to by a single gene name.

Cheers Kristoffer

ADD COMMENTlink written 11 months ago by kristoffer.vittingseerup2.5k

Well, thanks for your answer, Kristoffer. So what you mean is that I use gene id instead of gene name or symbols? But I still have some points unclear. You mentioned that you always analyze the data with gene ids (here is the Ensembl ID ), so what kind of analysis you do? I want to perform a PCA analysis by using these data in order to test the structure of these samples, I mean to see if the same kind of samples cluster together or not. And could you please give me a more detailed explanation for why do not use gene name(symbols)?

ADD REPLYlink written 11 months ago by dz235380
1

Yes just use gene_ids (such as Ensemble ids). Using that you get an expression matrix just like normal (except slightly larger) which you can do all downstream analysis with. Which "name" you assign to a gene (gene name, symbol or id) does not matter for the downstream analysis.

With regards to why not to use gene symbols the main reason is that if you do not use gene id you will just sum up the expression of all the loci and thereby assume the different loci produce proteins with the exact same function. This might be true for some cases but in general that is to bold an assumption. Furthermore it might reduce you power for some downstream analysis since you loose the genomic context of the gene. Lastly as explained above - for any systems biology gene names are very problematic since they often refere to multiple non-identical protein product for historical reasons.

ADD REPLYlink written 11 months ago by kristoffer.vittingseerup2.5k
1

Wow, much clear and so appreciate it!

ADD REPLYlink written 11 months ago by dz235380
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1374 users visited in the last hour