Alternative sequence gene Ensembl ID
1
0
Entering edit mode
8.3 years ago
bharata1803 ▴ 560

Hello all,

So, I want to cluster a group of genes with dist and hclust function in R. The data matrix is raw count. I know the genes symbol for the group of genes that I want to cluster. Because I used Ensembl ID for the matrix, I used Biomart for translating the gene symbol into the ensembl id. At this point, I realized several gene is duplicated with different ensembl id. I check one by one for those duplicate genes and decide to remove the alternative sequence genes. So, right now I have a non-duplicate and non-alternative sequence gene list with ensembl ID. Problem occur after I tried to compare the result of hierarchical clustering for list before and after removing the alternative sequence genes. The result is so different that probably will change the meaning of my analysis. My question is, should I filter or not the alternative sequence? If I check the entrez ID, there is only one entrez ID corresponds to gene symbol. So, the problem is in the ensembl ID. Thank you all.

ensembl • 3.6k views
ADD COMMENT
2
Entering edit mode
8.3 years ago

In Ensembl, one gene symbol can map to multiple gene IDs for different reasons, the most common ones being haplotype variants and duplicated genes. You have to decide whether or not you're interested in these. The question you have to ask yourself is: what is a gene in your context? e.g. is a gene a genomic locus or a set of loci producing the same/similar products ? Depending on the answer, you may want to filter your data or merge data related to the same gene,

ADD COMMENT
0
Entering edit mode

Well, what I want to do is getting gene from a certain GO or pathway like in KEGG. So, I start with entrez ID from KEGG and try to mapp the id to ensembl ID. What I don't know is, what kind of definition is used for GO or KEGG to define what gene is. What is the usual way to do this? I think mapping between GO or pathway is a common method in RNA-seq analysis, right?

ADD REPLY
0
Entering edit mode

In this case, the genes stand for their products (most often proteins) so you don't care about variants. You should them summarize the data for each gene i.e. collapse all the variants of a gene into the same gene. For example, all Ensembl IDs that map to the same Entrez ID/gene symbol could be considered the same gene.

ADD REPLY

Login before adding your answer.

Traffic: 2174 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6