News:Ensembl Release 104 and newer GTF files no longer have genes sorted by position
0
0
Entering edit mode
10 weeks ago
dlaehnemann ▴ 10

Following up on my previous post, I dug deeper and want to more precisely describe my "problem". Up until and including Ensembl Release 103, the GTF files provided had all the gene entries in strictly sorted order (with all the transcript, exon, etc. entries pertaining to a gene entry listed right after it, not necessarily in strict position sort order). You can double-check the gene sort order with:

wget https://ftp.ensembl.org/pub/release-103/gtf/homo_sapiens/Homo_sapiens.GRCh38.103.gtf.gz
zgrep -v "^#" Homo_sapiens.GRCh38.103.gtf.gz | awk 'BEGIN { prev_start = 0; prev_chr = "nothing"; jumps = 0 } { if ($3 == "gene") { if ( $4 < prev_start && $1 == prev_chr ) {jumps += 1}; prev_chr = $1; prev_start = $4 } } END {print jumps}'

This returns 0, meaning that you never have a gene start position jump back within a chromosome, so they are in ascending position sort order within each chromosome.

However, starting with Ensembl release 104, there are varying numbers of genes (somewhere in the range of 17,000 - 21,000) that appear later in the GTF than a strictly ascending gene start position sort would have them. For example, the following returns 20455:

wget https://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/Homo_sapiens.GRCh38.104.gtf.gz
zgrep -v "^#" Homo_sapiens.GRCh38.104.gtf.gz | awk 'BEGIN { prev_start = 0; prev_chr = "nothing"; jumps = 0 } { if ($3 == "gene") { if ( $4 < prev_start && $1 == prev_chr ) {jumps += 1}; prev_chr = $1; prev_start = $4 } } END {print jumps}'

And Ensembl release 105 has 19869 genes out of order, the latest release 108 has 17591.

To me, it seems likely that this came with updates to the gene set and the transcript annotations in Release 104, although I was not able to track down what might have changed in the GTF pipeline or if this was somehow done manually in the background.

I just wanted to notify others of this change, as this tripped me up and took very long to track down.

Background: A tool I am using was assuming both the gene sort order (which, alas, it did not assert) and the Ensembl sort order of records within a gene to be stable. To my knowledge, there's no simple solution with an existing tool for getting the genes back into the sort order while also preserving the standard Ensembl record order within a gene, so we'll have to change how this tool parses GTF records...

Ensembl GTF • 374 views
ADD COMMENT
0
Entering edit mode

Ben_Ensembl can you comment?

ADD REPLY
0
Entering edit mode

Hi GenoMax and dlaehnemann,

Thank you for your patience - I've been discussing this issue with my colleagues. The ordering (or lack of) in the Ensembl GTF/GFF3 may have recently changed as a consequence of changes we made to our pipelines which dump the GTF and GFF3 files on the FTP site. However, there are no requirements for GTF or GFF3 files to be ordered: http://gmod.org/wiki/GFF3

Therefore, we plan to continue dumping the GTF and GFF3 files as unordered files.

I hope this helps

Ben

ADD REPLY
0
Entering edit mode

It's a clarification, so that's appreciated.

One suggestion, as this is now a conscious decision: maybe you could add a quick sentence to the Ensembl GTF/GFF description at https://www.ensembl.org/info/website/upload/gff.html, that states that this file format has no standardized sort order and that the Ensembl dumps don't enforce any. I definitely went checking there, and had I found such a statement would have known to look no further.

This info could also go to the Ensembl FTP download README files, e.g. at https://ftp.ensembl.org/pub/release-108/gtf/homo_sapiens/README (I also looked there). However, this probably applies to all species across all newer releases, so would probably require a greater change.

ADD REPLY
0
Entering edit mode

No problem, dlaehnemann and thank you for the feedback. We plan to add this information to the relevant READMEs on the Ensembl FTP site as soon as possible.

ADD REPLY

Login before adding your answer.

Traffic: 3170 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6