Ensembl API - pseudoautosomal regions (PAR)
1
0
Entering edit mode
7.4 years ago
JS • 0

I access Ensembl data via their Perl API and retrieve information on genes, transcripts etc. I have made the observation that if I get data from their database's gene table there are genes which ocurr twice, once on the X and once on the Y chromosome. This affects 45 human genes, for 34/45 genes the start and end positions on X and Y are identical.

Two examples:

geneID biotype chromosome start end
ENSG00000002586 protein_coding X 2691179 2741309
ENSG00000002586 protein_coding Y 2691179 2741309
ENSG00000124333 protein_coding X 155881293 155943769
ENSG00000124333 protein_coding Y 57067813 57130289

When querying some of these genes via the Ensembl website it turned out that they are mapped to pseudoautosomal regions (identical sequence on X and Y).

Some more information on how I retrieve the data:

To speed things up I iterate over chromosomes in parallel and retrieve all genes as follows:

$slice = $slice_adaptor -> fetch_by_region('chromosome', $chr_name);
my @genes = @{$slice -> get_all_Genes()};

So basically ENSG00000002586 is in @genes when querying information on X and when querying information on Y. If I, however, go via the gene I only get the X chromosome:

my $gene_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Gene' );

my $gene = $gene_adaptor->fetch_by_stable_id( 'ENSG00000124333');

print $gene->seq_region_name(); # => X

On http://lists.ensembl.org/pipermail/dev/2010-October/000214.html they say that a gene might exceed a pseudoautosomal region and thus extend into a region unique to the Y chromosome. This could be a reason why a gene shows up for X and Y. However, I checked this and there is no overlap between unique regions of Y and the gene coordinates.

Questions

  • How come the positions are identical for some of the genes?
  • Has anyone observed this as well and figured out why one gets these duplicate gene entries?
gene genome • 2.0k views
ADD COMMENT
2
Entering edit mode
7.4 years ago

As you have found, Y chromosome is partially identical to the X and these are designated pseudoautosomal regions. In PARs, the genes only appear on X because, in Ensembl, we do not duplicate them in the identical region of Y (internally there is no Y in that coordinate range). The original and unique annotation is on the X chromosome, but it appears on the Y if looked at from Y.

The PARs are mapped between X and Y as: Y:10001-2781479 to X:10001-2781479 and Y:56887903-57217415 to X:155701383-156030895

So any gene from X on the first block will have the same coordinates on Y, but as you are seeing in your second example (ENSG00000124333), this isn’t true for the second block.

We import the assembly from the GRC, and this is the same as their representation: https://www.ncbi.nlm.nih.gov/grc/human

Hopefully this helps.

ADD COMMENT

Login before adding your answer.

Traffic: 2648 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6