Forum:GTF files from Ensembl Releases 105 and 106 unsorted
1
1
Entering edit mode
11 weeks ago
dlaehnemann ▴ 10

Update 2: Digging deeper provided the insight, that starting with Ensembl Release 104, gene records in the GTF files do not appear in ascending order any more (in Releases 103 and lower, they did). Here's what I found out: Ensembl Release 104 and newer GTF files no longer have genes sorted by position

Update: It simply seems like these files are not necessarily listing genes in the order of their chromosomal position and the order can change from release to release. So you cannot rely on them staying the same, but have to ensure a stable sort yourself.

Long story short:

The GTF files from Ensembl Releases 105 and 106 are not sorted properly, for some reason. If you need them sorted, just avoid those versions (getting them sorted does not seem to be possible in a sane and sound way, I tried). Namely, these files are broken:

Releases 104, 107 and 108 seem to be properly sorted.

I have found no indication of anything relating to this anywhere with a reasonable web search, and by manually looking through the Ensembl release notes and the Ensembl production code repository on GitHub (which I think contains the code that produces those files). As this latter repo does not allow for the creation of issues (at least not for me) and as I did not find any way to file a bug report on the Ensembl website, either, I am documenting this problem here, for others to find. This has cause me multiple days of hunting down weird behaviour in a tool, where it eventually turned out that it relies on the input GTFs being sorted.

Ensembl bug GTF • 408 views
ADD COMMENT
3
Entering edit mode
11 weeks ago
ATpoint 68k

There is nothing wrong with these files. Sort (as any GTF):

zcat Homo_sapiens.GRCh38.105.gtf.gz \
| awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k4,4n -k5,5n"}' \
| bgzip > Homo_sapiens.GRCh38.105_sorted.gtf.gz

That having said, if you need the file being strictly coordinate-sorted then you always have to do that manually. An application could be indexing by tabix, that never works on Ensembl files out of the box, you always needed to sort them by coordinate.

Afaik, the order is that for a given gene the gene type comes first, then transcript and then all the exons and other attributes for that transcript. And that is repeated for every transcript of that gene. This is more of a "logical" than strict coordinate order, might be meaningful for some parsing purposes where a line-by-line parser would first pick up the gene as a whole and then sequentially all its transcripts and components (CDS, UTR...) without the need to look "ahead" or "back" in the file.

Hence, I do not see a problem here, maybe you assumed that the files were strictly coordinate-sorted? They're not (and never have been).

Edit: As suggested by GenoMax on Slack, a dedicated toolkit such as https://agat.readthedocs.io/en/latest/index.html might be useful to sort GTF file, especially when GTFs are not as well-formatted as Ensembl, to capture some edge cases in formatting etc.

ADD COMMENT
2
Entering edit mode

I often use the following for GTF/GFF sorting:

sort -k1,1V -k4,4n -k5,5rn -k3,3r some.gff > some.sorted.gff

first sort on seq (k1) in natural sorting mode (V ; will sort like: seq1 seq2 seq10), then on start coord (k4), then rev on stop (k5) to get gene etc above cds and utr per gene, last rev on feature (k3) to get gene above CDS if they are equal

Captures more cases than the 'default' one but keep in mind not all sorting is done properly

Also a +1 for the AGAT approach :-)

ADD REPLY
2
Entering edit mode

*cough cough

CGAT will also do this:

cgat gtf2gtf --stdin Homo_sapiens.GRCh38.105.gtf.gz --method=sort --sort-order=[gene|gene+transcript|transcript|position|contig+gene|position+gene|gene+position|gene+exon]
ADD REPLY
0
Entering edit mode

Thanks for the response. The problem actually isn't in the sub-ordering within genes, but that genes were completely out of order, and only in these two versions. With the file format so loosely specified, consistency is important as tools will often implicitly rely on it. But thanks for all the sorting suggestions, some of them I hadn't seen yet (I did try to do the sorting myself for consistency, but couldn't get it to sort correctly with standard sort).

ADD REPLY

Login before adding your answer.

Traffic: 3171 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6