Why am I getting different ensembl gene ids for a given gene symbol?
3
7
Entering edit mode
7.1 years ago
ravihansa82 ▴ 100

dear friends

I have set of gene symbols. when I convert such symbols to appropriate ensembl gene ids, it gave me different gene ids for a given gene symbol instead of one gene id for a given gene symbol. why is this happen?

genome gene sequence ensembl • 15k views
1
Entering edit mode

Can you give us an example please?

0
Entering edit mode

Thank you for the reply. here is an example.

For example, I used "AGPAT1" gene symbol. I converted this gene symbol to ensemble gene ID using online BIoMart tool. It gave me seven different Ensembl Gene IDs as follows.

HGNC symbol     Ensembl Gene ID
AGPAT1  ENSG00000228892
AGPAT1  ENSG00000235758
AGPAT1  ENSG00000227642
AGPAT1  ENSG00000204310
AGPAT1  ENSG00000236873
AGPAT1  ENSG00000226467
AGPAT1  ENSG00000206324

0
Entering edit mode

Yes friend I did it careful selection of taxon. the problem was I need to extract some intron sequence from set of genes. once I convert such gene symbols to Ensembl Gene IDs some genes end up with giving different Ensemble Gene IDs for some given gene symbols.

18
Entering edit mode
7.1 years ago
Emily 23k

The gene you're looking at, AGPAT1, is found on a haplotypic region. Haplotypes are regions of the genome which have two or more versions, which we find in full in different individuals. These may have the same genes in a different order, or even different genes. We have a help video explaining this here.

AGPAT1 is found in the haplotypic MHC region, of which there are nine possible versions of the genome, and it is found in seven of those nine. You can see all the possible Ensembl IDs for the different versions of AGPAT1 here.

3
Entering edit mode

This could be a problem if there are multiple "gene_id" for same "gene_name" with the quantification of RNA-Seq data using htseq/featureCounts as the reads will fall under ambiguous category i.e they overlap multiple genes.

0
Entering edit mode

I got this problem and don't know how to process it.

0
Entering edit mode

Nice video - so approximately how may proteins are multiplexed in this way ?

2
Entering edit mode

In the current database, 661. Some will only have two members, others like AGPAT1 have lots. One haplotype set has 36 different versions on chromosome 19.

At the moment the current human genome, GRCh38, only has haplotypes, but GRC has already started making patches to repair misassembled or gapped genomic regions. We will bring these in and annotate them so we'll be looking at more duplicate genes, however in the case of patches, the gene on the patch is good and the gene on the primary is dodgy. This is different to haplotypes where all genes are equally valid.

0
Entering edit mode

thank you for the explanation. It means certain gene fall in to haplotypic region have different version of same genes and each different versions are named by different Ensembl gene IDs am I correct?

1
Entering edit mode

That is correct.

0
Entering edit mode

Dear Emily,

I need one more explanation.I extracted first intron of a gene which fall in to haplotype region. suppose it produces seven haplotypes hence I got 7 first intron sequences. considering the sequence length, 4 out of 7 had same length.but rest of the sequences in different lengths. can I consider latter sequences in such haplotype region as different gene?

1
Entering edit mode

The haplotypes are different to each other. Expansion/contraction of an intron between haplotypes is unsurprising. I would consider them to be the same gene if the cDNA is the same, not the introns.

0
Entering edit mode

@Emily can you pls suggest if the two or more ensembl ids (ultimately also their sequences ) can be used interchangably

0
Entering edit mode

It depends how different they are and what you need to do with them. I'd put the cDNA sequences into CLUSTAL and see if they are interchangable or not.

0
Entering edit mode

if we have "genes with the same hogu ids but different ensemble id" does it make sense to add up the raw count of those? ( for RNA expression or single cell analysis). Does it make sense to treat them as isoforms?

0
Entering edit mode

You should make a new post on BioStars for this question – you'll get a lot more answers.

2
Entering edit mode
7.1 years ago
Brice Sarver ★ 3.7k

You may be receiving IDs from other species, like NCBI's BRCA1 example. Impossible to tell without more information.

0
Entering edit mode

I used HGNC gene symobls. For example I converted this gene symbol AGPAT1, to Ensembl Gene ID using online BioMart tool. As a result it gave me seven different Ensembl Gene IDs as follows.

HGNC symbol     Ensembl Gene ID
AGPAT1  ENSG00000228892
AGPAT1  ENSG00000235758
AGPAT1  ENSG00000227642
AGPAT1  ENSG00000204310
AGPAT1  ENSG00000236873
AGPAT1  ENSG00000226467
AGPAT1  ENSG00000206324

0
Entering edit mode

It is more possible. Example, Y_RNA gene name has different ENSG's and also each located in different chromosomes (chr1,3,4,12,14,20,X).

That's the reason whenever someone starts the analysis take one transcript/gene annotaion into account example, Gencode or Ensembl. Also consider ENGSs are reference ids till the end of your analysis (to avoid redundent ids, example gene name/symbols).

0
Entering edit mode

Actually, it is possible to say which species since these are Ensembl IDs - these are all human. Other species would have a different prefix ("ENSRNOG" for rat genes, "ENSMUSG" for mouse, etc)

2
Entering edit mode
7.1 years ago
EagleEye 7.2k
When you map ids always careful in choosing right taxon. Example: Homo sapiens 9606, Mus musculus 10090.
0
Entering edit mode

Mouse IDs have a different prefix, it's "ENSMUSG.." not "ENSG.."