Convert an 'intron-style' GFF3 file into an 'exon-style' GFF3 file
3
2
Entering edit mode
10.2 years ago
Dan ▴ 530

I have a GFF3 file that doesn't have exons, instead it has introns, UTRs, start and stop codons:

0001.scaffold00002      AUGUSTUS        gene    1386    2772    0.12    +       .       ID=Bv_00001z1_qhas;Name=Bv_00001z1_qhas
0001.scaffold00002      AUGUSTUS        mRNA    1386    2772    0.12    +       .       ID=Bv_00001z1_qhas.t1;Parent=Bv_00001z1_qhas;Name=Bv_00001z1_qhas.t1 0%;Note=cDNAcoverage_0%
0001.scaffold00002      AUGUSTUS        five_prime_UTR  1386    1976    .       +       .       ID=Bv_00001z1_qhas.t1.UTR;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        start_codon     1977    1979    .       +       0       ID=Bv_00001z1_qhas.t1.start_codon;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        CDS     1977    2325    0.96    +       0       ID=Bv_00001z1_qhas.t1.CDS;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        intron  2326    2619    0.81    +       .       ID=Bv_00001z1_qhas.t1.intron;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        CDS     2620    2747    0.8     +       2       ID=Bv_00001z1_qhas.t1.CDS;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        stop_codon      2745    2747    .       +       0       ID=Bv_00001z1_qhas.t1.stop_codon;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        three_prime_UTR 2748    2772    .       +       .       ID=Bv_00001z1_qhas.t1.UTR;Parent=Bv_00001z1_qhas.t1

I can convert this to 'exon-style' by calculating the exons from the above, but I'm wondering if there is an 'off the shelf' solution?

Cheers,
Dan.

GFF3 intron exon format conversion • 5.5k views
ADD COMMENT
5
Entering edit mode
10.2 years ago
Dan ▴ 530

Actually, this can be done with GenomeTools. The dupfeat command duplicates features of type -source and outputs the copies with type dest. The mergefeat command merges adjacent features of the same type:

gt dupfeat -dest exon -source CDS your.gff3 \
  | gt dupfeat -dest exon -source three_prime_UTR \
  | gt dupfeat -dest exon -source five_prime_UTR \
  | gt mergefeat \
  | gt gff3 -retainids -sort -tidy -o your.new.gff3

Pretty slick!

ADD COMMENT
0
Entering edit mode

good to know

ADD REPLY
0
Entering edit mode

GenomeTools is the answer for many questions I have about GFF3 processing!

ADD REPLY
2
Entering edit mode
10.2 years ago
Dan ▴ 530

Here is my answer in full, complicated by the fact that the dumb format wasn't consistent in it's stupidity:

ADD COMMENT
0
Entering edit mode
10.2 years ago

Looks like your CDS' are the exons, only that the CDS' also include the stop codon that is not actually part of the mRNA.

I don't think that there is a tool to do what you need in one step.

ADD COMMENT
0
Entering edit mode

Right, the CDS are the exons except when interrupted by a start (stop) codon, in which case the exon includes the five (three) prime UTR.... I guess?

ADD REPLY
0
Entering edit mode

the definition for these is actually a lot more complicated, and I suspect tool developers may be a little cavalier in labeling. I would not be surprised if there were inconsistencies along the way. It all depends what is the file needed for.

Exon: http://www.sequenceontology.org/browser/current_svn/term/SO:0000147

CDS: http://www.sequenceontology.org/browser/current_svn/term/SO:0000316

ADD REPLY
0
Entering edit mode

The definitions are (now) clear (and the GFF validates OK), the pain is knowing if your CDS abuts a five (three) prime UTR (or both!) and if your five (three) prime UTR is a separate exon... Actually, my solution has been ignoring the intron features, these let me solve it actually! I'll post Perl when I'm done.

ADD REPLY

Login before adding your answer.

Traffic: 2296 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6