Question: Alternative sequence gene Ensembl ID
gravatar for bharata1803
3.3 years ago by
bharata1803420 wrote:

Hello all,


So, I want to cluster a group of genes with dist and hclust function in R. The data matrix is raw count. I know the genes symbol for the group of genes that I want to cluster. Because I used Ensembl ID for the matrix, I used Biomart for translating the gene symbol into the ensembl id. At this point, I realized several gene is duplicated with different ensembl id. I check one by one for those duplicate genes and decide to remove the alternative sequence genes. So, right now I have a non-duplicate and non-alternative sequence gene list with ensembl ID. Problem occur after I tried to compare the result of hierarchical clustering for list before and after removing the alternative sequence genes.  The result is so different that probably will change the meaning of my analysis. My question is, should I filter or not the alternative sequence? If I check the entrez ID, there is only one entrez ID corresponds to gene symbol. So, the problem is in the ensembl ID. Thank you all.

ensembl • 1.5k views
ADD COMMENTlink modified 3.3 years ago by Jean-Karim Heriche18k • written 3.3 years ago by bharata1803420
gravatar for Jean-Karim Heriche
3.3 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche18k wrote:

In Ensembl, one gene symbol can map to multiple gene IDs for different reasons, the most common ones being haplotype variants and duplicated genes. You have to decide whether or not you're interested in these. The question you have to ask yourself is: what is a gene in your context ? e.g. is a gene a genomic locus or a set of loci producing the same/similar products ? Depending on the answer, you may want to filter your data or merge data related to the same gene,

ADD COMMENTlink written 3.3 years ago by Jean-Karim Heriche18k

Well, what I want to do is getting gene from a certain GO or pathway like in KEGG. So, I start with entrez ID from KEGG and try to mapp the id to ensembl ID. What I don't know is, what kind of definition is used for GO or KEGG to define what gene is. What is the usual way to do this? I think mapping between GO or pathway is a common method in RNA-seq analysis, right?

ADD REPLYlink written 3.3 years ago by bharata1803420

In this case, the genes stand for their products (most often proteins) so you don't care about variants. You should them summarize the data for each gene i.e. collapse all the variants of a gene into the same gene. For example, all Ensembl IDs that map to the same Entrez ID/gene symbol could be considered the same gene.

ADD REPLYlink written 3.3 years ago by Jean-Karim Heriche18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1563 users visited in the last hour