How much the sorting order of a gff file matters?
1
1
Entering edit mode
3.6 years ago
Shred ★ 1.4k

Hi guys,

While editing a gff3 file (genome annotation) with a custom script, I need to split the file by strand, obtaining a "forward.gff" and a "reverse.gff". For downstream analysis it's required, after editing, to merge these two into a single one.

I noticed that the gff3 format is sorted by chromosome's number (1st column) and start position (4th column) of a gene, with associated records (mRNA, exon, etc..) following. Now it's clear that in order to merge these two files and to sort them I couldn't just launch a sort command by 1st and 4th field, because start coordinates are indipendent between strands.

So here's the question: if I would just merge the two files without caring of start coordinates, how much will impact onto downstream analysis (reads alignment and gene counting)? Does the sorting order of a gff file truly matters?

Additional question: if someone knows a fast way to sort them back like the previous file, please write it down.

gff RNA-Seq annotation • 1.8k views
ADD COMMENT
1
Entering edit mode

AGAT toolkit contains many GFF file related tools. Check to see if you find something usable. @Juke (author) participates on Biostars and will likely notice this question too.

ADD REPLY
0
Entering edit mode

^^ you were right, here I am

ADD REPLY
1
Entering edit mode
3.6 years ago
Juke34 8.5k

I quickly raise the topic there: https://github.com/NBISweden/AGAT/wiki/Topological-sorting-of-gff-features .

Does the sorting order of a gff file truly matters?

I would say it depends the tools. Most of time when it really matter the tools mention it in the manual. It depends also what you mean by "order". As in your case having reverse features later in the file compare to the forward ones should not be a problem for most of the tools. But mixing up the features of a same direction (having exon defined before its mRNA) might be more problematic. If you use the sort command you might end up in this situation. As explained here about GNU sort:

Lines with the same chromosomes and start positions would be placed randomly. Therefore, parent feature lines might sometimes be placed after their children lines

Most of tools will complain if something is wrong with the sorting while parsing the file.

ADD COMMENT
0
Entering edit mode

Thanks for the answer. STAR for alignment and HtSeq for counts.

I'll definitly try to merge the two strand files into a single one sorting by chromosome and strand, in order to have a structure like:

chr1     [...]    +
chr1     [...]    -
chr2     [...]    +

In order to prevent alterations of the part-of relationship: to test later with mentioned tools. Linked post suggest gff3-retainids, I'll try to use it against the unordered merged file to see results. Which is the AGAT tool I could try to use? It needs to just sort, no records must be changed in its attributes.

ADD REPLY
0
Entering edit mode

Agat_convert_sp_gxf2gxf.pl —gff in.gff —o out.sorted.gff

ADD REPLY
1
Entering edit mode

I've tried (after a missing package into the declared dependencies, I'll push an issue on Github) but sadly the software does several other things, and my result is an alterated gff where UTR attributes were completely changed.

ADD REPLY
0
Entering edit mode

Probably missing UTRs that have been created. Look at the log to see what has been fixed by the tool. You can avoid that by using the --no_check parameter.

ADD REPLY

Login before adding your answer.

Traffic: 2336 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6