Question: How to order a gff3 file by coordinates
0
gravatar for Antonio R. Franco
18 months ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.1k wrote:

I have discovered that my gff3 file is not in order at the time of defining the gene, mRNA and CDS. An example

 LG1     phytozomev10    gene    10835748        10846741        .        -       .       ID=gene00257-v1.0-hybrid.v1.1;Name=gene00257-v1.0-hybrid
    LG1     phytozomev10    mRNA    10835748        10846741        .       -       .       ID=mrna00257.1-v1.0-hybrid.v1.1;Name=mrna00257.1-v1.0-hybrid;pacid=27244575;longest=1;Parent=gene00257-v1.0-hybrid.v1.1
    LG1     phytozomev10    CDS     10846566        10846741        .       -       2       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.1;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10841035        10841272        .       -       0       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.2;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10840828        10840916        .       -       2       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.3;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10839072        10839109        .       -       0       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.4;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10837291        10838461        .       -       1       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.5;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10836538        10836623        .       -       0       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.6;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    CDS     10835748        10835776        .       -       1       ID=mrna00257.1-v1.0-hybrid.v1.1.CDS.7;Parent=mrna00257.1-v1.0-hybrid.v1.1;pacid=27244575
    LG1     phytozomev10    gene    10862624        10876720        .       +       .       ID=gene00258-v1.0-hybrid.v1.1;Name=gene00258-v1.0-hybrid
    LG1     phytozomev10    mRNA    10862624        10876720        .       +       .       ID=mrna00258.1-v1.0-hybrid.v1.1;Name=mrna00258.1-v1.0-hybrid;pacid=27244449;longest=1;Parent=gene00258-v1.0-hybrid.v1.1
    LG1     phytozomev10    CDS     10862624        10862667        .       +       0       ID=mrna00258.1-v1.0-hybrid.v1.1.CDS.1;Parent=mrna00258.1-v1.0-hybrid.v1.1;pacid=27244449
    LG1     phytozomev10    CDS     10862746        10863050        .       +       1       ID=mrna00258.1-v1.0-hybrid.v1.1.CDS.2;Parent=mrna00258.1-v1.0-hybrid.v1.1;pacid=27244449
    LG1     phytozomev10    CDS     10863146        10863223        .       +       2       ID=mrna00258.1-v1.0-hybrid.v1.1.CDS.3;Parent=mrna00258.1-v1.0-hybrid.v1.1;pacid=27244449
    LG1     phytozomev10    CDS     10864316        10864463        .       +       2       ID=mrna00258.1-v1.0-hybrid.v1.1.CDS.4;Parent=mrna00258.1-v1.0-hybrid.v1.1;pacid=27244449
    LG1     phytozomev10    CDS     10864616        10864850        .       +       1       ID=mrna00258.1-v1.0-hybrid.v1.1.CDS.5;Parent=mrna00258.1-v1.0-hybrid.v1.1;pacid=27244449

antonio@PC-DESPACHO:/mnt/d/Dropbox/Fvesca Anotacion/annotation$ cat Fvesca_226_v1.1.gene.gff3 | grep "LG1" | tail

    LG1     phytozomev10    CDS     8224242 8224301 .       +       0       ID=mrna35195.1-v1.0-hybrid.v1.1.CDS.5;Parent=mrna35195.1-v1.0-hybrid.v1.1;pacid=27245751
    LG1     phytozomev10    gene    8551588 8551849 .       -       .       ID=gene35196-v1.0-hybrid.v1.1;Name=gene35196-v1.0-hybrid
    LG1     phytozomev10    mRNA    8551588 8551849 .       -       .       ID=mrna35196.1-v1.0-hybrid.v1.1;Name=mrna35196.1-v1.0-hybrid;pacid=27245617;longest=1;Parent=gene35196-v1.0-hybrid.v1.1
    LG1     phytozomev10    CDS     8551817 8551849 .       -       0       ID=mrna35196.1-v1.0-hybrid.v1.1.CDS.1;Parent=mrna35196.1-v1.0-hybrid.v1.1;pacid=27245617
    LG1     phytozomev10    CDS     8551588 8551713 .       -       0       ID=mrna35196.1-v1.0-hybrid.v1.1.CDS.2;Parent=mrna35196.1-v1.0-hybrid.v1.1;pacid=27245617
    LG1     phytozomev10    gene    8554333 8555118 .       +       .       ID=gene35197-v1.0-hybrid.v1.1;Name=gene35197-v1.0-hybrid
    LG1     phytozomev10    mRNA    8554333 8555118 .       +       .       ID=mrna35197.1-v1.0-hybrid.v1.1;Name=mrna35197.1-v1.0-hybrid;pacid=27245093;longest=1;Parent=gene35197-v1.0-hybrid.v1.1
    LG1     phytozomev10    CDS     8554333 8554716 .       +       0       ID=mrna35197.1-v1.0-hybrid.v1.1.CDS.1;Parent=mrna35197.1-v1.0-hybrid.v1.1;pacid=27245093
    LG1     phytozomev10    CDS     8554791 8554854 .       +       0       ID=mrna35197.1-v1.0-hybrid.v1.1.CDS.2;Parent=mrna35197.1-v1.0-hybrid.v1.1;pacid=27245093
    LG1     phytozomev10    CDS     8554946 8555118 .       +       2       ID=mrna35197.1-v1.0-hybrid.v1.1.CDS.3;Parent=mrna35197.1-v1.0-hybrid.v1.1;pacid=27245093

As you can see, the first gene/mRNA starts by the coordinate 10835748, and contains a certain number of exons. But in the same file, you can see in the TAIL of the same gff3 file, there is another gen/mRNA starting by 8224242. This gff3, in turn, is fully disordered

I would like to order the gff3 by coordinates, by preserving in each case the gene, mRNA lanes, and the presence of their corresponding CDS (exons) of each of the gene which is an information contained in each of the CDS lanes

ordering gff3 • 1.1k views
ADD COMMENTlink modified 12 months ago by EagleEye6.4k • written 18 months ago by Antonio R. Franco4.1k
0
gravatar for lieven.sterck
18 months ago by
lieven.sterck5.6k
VIB, Ghent, Belgium
lieven.sterck5.6k wrote:

depending on the keys that are present in your gff file this oneliner will get you a long way:

sort -k1,1V -k4,4n -k5,5rn -k3,3r some.gff > some.sorted.gff

might still need a little tweaking though

ADD COMMENTlink modified 18 months ago • written 18 months ago by lieven.sterck5.6k
0
gravatar for cmdcolin
12 months ago by
cmdcolin1.2k
United States
cmdcolin1.2k wrote:

There is the GFF3Sort tool from https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1930-3

This tool allows "topological sort" allowing gene to be above mRNA to be above it's subfeatures.

There is also GenomeTools which can do

gt gff3 -sortlines yourfile.gff > yourfile.sorted.gff

This does not guarantee topological sort

Then there is the GNU sort mentioned by lieven.sterck but this is not intrinsically topological either unless that extra options like sorting on column 3 induce that

All three methods should be compatible with tabix though, and tabix doesn't require topological sort

ADD COMMENTlink written 12 months ago by cmdcolin1.2k

The genometools command gt gff3 -sort yourfile.gff > yourfile.sorted.gff is also available and uses a slightly different algorithm than sortlines. Could be of interest

ADD REPLYlink written 12 months ago by cmdcolin1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 941 users visited in the last hour