Question: Is there a tool that sorts gtf files?
0
gravatar for JJ
2.0 years ago by
JJ500
JJ500 wrote:

Hi all,

I am looking for a tool that sorts gtf annotation files. Can anyone recommend one? I only came across tools that sort gff files like bedtools or gt.

I am grateful for any suggestions!

Thanks JJ

rna-seq genome • 4.8k views
ADD COMMENTlink modified 8 weeks ago by Juke-343.7k • written 2.0 years ago by JJ500

Hello,

at what criteria do you want to sort and why? The columns you can sort with the standard unix command sort.

fin swimmer

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by finswimmer13k

actually I want to add some annotations to a standard annotation gtf file and then use the standard sorting to put the newly added annotations at their "proper" place.

I was thinking of Stringtie --merge as an alternative but as the annotation file and the new annotations are non-redudant I figured a simple sort should also do the trick.

ADD REPLYlink written 2.0 years ago by JJ500
1

I want to add some annotations to a standard annotation gtf file and then use the standard sorting to put the newly added annotations at their "proper" place

I had to do the same exact thing not long ago, here is the full recipe just in case it might help ;-)

ADD REPLYlink written 2.0 years ago by erwan.scaon750
3
gravatar for erwan.scaon
2.0 years ago by
erwan.scaon750
Nantes - France
erwan.scaon750 wrote:

I recommand to sort with the tool "gff3sort", given that with stardard unix sort, lines with the same chromosomes and start positions will be placed randomly.

gff3sort avoid this pitfall.
For example :

# Sort your gtf/gff & bgzip it
gff3sort.pl --precise --chr_order natural file.gtf/gff | bgzip > file.gtf/gff.gz;

# Create associated index
tabix -p gff file.gtf/gff.gz;
ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by erwan.scaon750
1

still have to encounter to first useful example of using this gff3sort over normal linux sort

Find it as well surprising that this gets published while others are struggling to get real interesting biological stuff published ....

ADD REPLYlink written 2.0 years ago by lieven.sterck7.3k
1

Hello erwan,

I recommand to sort with the tool "gff3sort", given that with stardard unix sort, lines with the same chromosomes and start positions will be placed randomly.

gff3sort avoid this pitfall.

could you please explain why this should be a pitfall? If there are more criteria for sorting I have to define them in some way.

fin swimmer

ADD REPLYlink written 2.0 years ago by finswimmer13k
2

I think that what they mean is that a GFF file may need to be sorted by a column where the values are not ordered lexicographically or numerically. For example: mRNA needs to precede exon, and CDS may need to come after exon.

That being said a gff3sort should be a tool that creates the extra columns, translating the values to sortable ones, then a user should use sort directly. It is unlikely that a gff3sort written in perl would be able to compete in performance and features with a standard unix sort.

ADD REPLYlink written 2.0 years ago by Istvan Albert ♦♦ 83k

with a bit of reasoning (and knowledge of gff format) all things you can easily achieve with linux sort ;-)

and +1 for the performance comment!

ADD REPLYlink written 2.0 years ago by lieven.sterck7.3k

does this tool also acept gtf? I can only see gff as input stated.

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by JJ500
2

I assumed it would accept GTF when posting (after all GTF is "GFF2.5", which is really close to GFF3), but since you asked I did a quick check :

Not knowing what your GTF look like, I took a random example : I ran the gff3sort tool on both the GTF & GFF3 of the M16 comprehensive gene annotation. There was no errors. I then loaded tracks into IGV & both displayed just fine, which is another good sign. So you should give it a try with your own GTF.

I case you want to re-run the verification :

git clone https://github.com/billzt/gff3sort.git;
cd gff3sort;
axel -q ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M16/gencode.vM16.chr_patch_hapl_scaff.annotation.gtf.gz;
axel -q ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M16/gencode.vM16.chr_patch_hapl_scaff.annotation.gff3.gz;
unpigz *.gz;
perl ./gff3sort.pl --precise --chr_order natural gencode.vM16.chr_patch_hapl_scaff.annotation.gff3 | bgzip > toto.gff3.gz;
tabix -p gff toto.gff3.gz;
perl ./gff3sort.pl --precise --chr_order natural gencode.vM16.chr_patch_hapl_scaff.annotation.gtf | bgzip > toto.gtf.gz;
tabix -p gff toto.gtf.gz;

IGV check :

igv

ADD REPLYlink modified 2.0 years ago • written 2.0 years ago by erwan.scaon750

I think the main difference between gff and gtf is the parent tag - I have it in the gff but not in the gtf. I downloaded the main annotation files in both formats from gencode. hence, gff3sort.pl would not work on the gtf properly either...

ADD REPLYlink written 24 months ago by JJ500
4
gravatar for finswimmer
2.0 years ago by
finswimmer13k
Germany
finswimmer13k wrote:

gff3sort.pl seems to make sure lines having no "Parent=" attribute comes before those having it, if chrom and start position are the same. I think with unix standard program it should go like this:

$ (grep -v "Parent=" sortme.gtf;grep "Parent=" sortme.gtf)| sort -k1,1 -k4,4n -s

EDIT:

Should'nt we have to be sure that within these two groups the 5th column is sorted as well? If so, we have to expand the command a little bit:

(grep -v "Parent=" sortme.gff|sort -k1,1 -k4,4n -k5,5n;grep "Parent=" sortme.gff|sort -k1,1 -k4,4n -k5,5n)| sort -k1,1 -k4,4n -s

If more speed is required we can use gnu parallel.

parallel ::: 'grep -v "Parent=" sortme.gff|sort -k1,1 -k4,4n -k5,5n' 'grep "Parent=" sortme.gff|sort -k1,1 -k4,4n -k5,5n' | sort -k1,1 -k4,4n -s

fin swimmer

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by finswimmer13k
1

Thanks for this. However, I don't have the "Parent=" tag in the gtf file - I downloaded the main annotation file from gencode. Hence, your solution does not work for me ... any other suggestions? thanks!

ADD REPLYlink written 24 months ago by JJ500
1
gravatar for Juke-34
8 weeks ago by
Juke-343.7k
Sweden
Juke-343.7k wrote:

This blog talks about it: https://zhiganglu.com/post/sort-gff-topologically/

As I explain here you can use AGAT

The script to use is agat_sp_gxf_to_gff3.pl
You will have to play with the parameter -gvo to get back a gtf as output.

ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by Juke-343.7k
0
gravatar for kvshamsudheen
8 weeks ago by
kvshamsudheen40 wrote:

One could use the method as explained here . Just referring the same below

wget --no-check-certificate https://raw.github.com/ctokheim/PrimerSeq/master/gtf.py -O gtf.py  # get command line script
$ python gtf.py -c your_gtf_file.gtf  # check if GTF is sorted
your_gtf_file.gtf is not correctly sorted. please sort before use.
$ python gtf.py -i your_gtf_file.gtf -o your_gtf_file.sorted.gtf  # GTF was not sorted, so sort it
  
ADD COMMENTlink modified 8 weeks ago by RamRS26k • written 8 weeks ago by kvshamsudheen40

Just a heads up: this script was last updated in April 2016 as opposed to gff3sort.pl, which was updated in Feb 2019. It is definitely not a definite measure of relevance, but if I had to pick, I'd go for the latest.

ADD REPLYlink written 8 weeks ago by RamRS26k
0
gravatar for ATpoint
8 weeks ago by
ATpoint32k
Germany
ATpoint32k wrote:

awk one-liner:

awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k2,2n"}' in.gtf > out_sorted.gtf
ADD COMMENTlink written 8 weeks ago by ATpoint32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1889 users visited in the last hour