Convert an 'intron-style' GFF3 file into an 'exon-style' GFF3 file
3
2
Entering edit mode
11.1 years ago
Dan ▴ 540

I have a GFF3 file that doesn't have exons, instead it has introns, UTRs, start and stop codons:

0001.scaffold00002      AUGUSTUS        gene    1386    2772    0.12    +       .       ID=Bv_00001z1_qhas;Name=Bv_00001z1_qhas
0001.scaffold00002      AUGUSTUS        mRNA    1386    2772    0.12    +       .       ID=Bv_00001z1_qhas.t1;Parent=Bv_00001z1_qhas;Name=Bv_00001z1_qhas.t1 0%;Note=cDNAcoverage_0%
0001.scaffold00002      AUGUSTUS        five_prime_UTR  1386    1976    .       +       .       ID=Bv_00001z1_qhas.t1.UTR;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        start_codon     1977    1979    .       +       0       ID=Bv_00001z1_qhas.t1.start_codon;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        CDS     1977    2325    0.96    +       0       ID=Bv_00001z1_qhas.t1.CDS;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        intron  2326    2619    0.81    +       .       ID=Bv_00001z1_qhas.t1.intron;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        CDS     2620    2747    0.8     +       2       ID=Bv_00001z1_qhas.t1.CDS;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        stop_codon      2745    2747    .       +       0       ID=Bv_00001z1_qhas.t1.stop_codon;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        three_prime_UTR 2748    2772    .       +       .       ID=Bv_00001z1_qhas.t1.UTR;Parent=Bv_00001z1_qhas.t1

I can convert this to 'exon-style' by calculating the exons from the above, but I'm wondering if there is an 'off the shelf' solution?

Cheers,
Dan.

GFF3 intron exon format conversion • 6.2k views
ADD COMMENT
5
Entering edit mode
11.1 years ago
Dan ▴ 540

Actually, this can be done with GenomeTools. The dupfeat command duplicates features of type -source and outputs the copies with type dest. The mergefeat command merges adjacent features of the same type:

gt dupfeat -dest exon -source CDS your.gff3 \
  | gt dupfeat -dest exon -source three_prime_UTR \
  | gt dupfeat -dest exon -source five_prime_UTR \
  | gt mergefeat \
  | gt gff3 -retainids -sort -tidy -o your.new.gff3

Pretty slick!

ADD COMMENT
0
Entering edit mode

good to know

ADD REPLY
0
Entering edit mode

GenomeTools is the answer for many questions I have about GFF3 processing!

ADD REPLY
2
Entering edit mode
11.1 years ago
Dan ▴ 540

Here is my answer in full, complicated by the fact that the dumb format wasn't consistent in it's stupidity:

#! perl
use strict;
use warnings;
die "pass an intron-style GFF for me to calcualte exons over\n"
unless @ARGV;
## Tracking variables
my $prev_type = 'dumped';
my $prev_beg = -1;
my $prev_end = -1;
## Variables used for formatting only
my $prev_seq;
my $prev_ori;
my $prev_parent;
while(<>){
if (/^#/){
print unless /^###$/;
next;
}
my ($seq, $meth, $type, $beg, $end, $score, $ori, $phase, $nine)
= split /\t/;
## Time to dump any pending exonic regions and reset our counters...
if($type eq 'gene' ||
$type eq 'mRNA' ||
$type eq 'intron'){
if ($prev_type ne 'dumped'){
print join("\t", $prev_seq, "AUGUSTUS", "exon",
$prev_beg, $prev_end, '.', $prev_ori, '.',
"Parent=$prev_parent"), "\n";
$prev_type = 'dumped';
$prev_beg = -1;
$prev_end = -1;
}
## Grab some book keeping stuff...
if ($type eq 'mRNA'){
$prev_seq = $seq;
$prev_ori = $ori;
die unless /\tID=(\S+?);/;
$prev_parent = $1;
}
}
## One exception is when a UTR follows a UTR. This handy GFF
## decides not to put an intron here... just for consistency!
elsif (($type eq 'five_prime_UTR' ||
$type eq 'three_prime_UTR') && $type eq $prev_type){
print join("\t", $prev_seq, "AUGUSTUS", "exon",
$prev_beg, $prev_end, '.', $prev_ori, '.',
"Parent=$prev_parent"), "\n";
$prev_type = $type;
$prev_beg = $beg;
$prev_end = $end;
}
## Er.. back where we were...
elsif ($type eq 'five_prime_UTR' ||
$type eq 'three_prime_UTR' ||
$type eq 'CDS'){
## We don't dump an exon here!
## We just set the start position if needed
$prev_beg = $beg if $prev_type eq 'dumped';
## and increment the end
$prev_end = $end;
## and record this to handle the exception above...
$prev_type = $type;
}
## We don't care about any other types
print;
}
## Finishing touches:
if ($prev_type ne 'dumped'){
print join("\t", $prev_seq, "AUGUSTUS", "exon",
$prev_beg, $prev_end, '.', $prev_ori, '.',
"Parent=$prev_parent"), "\n";
$prev_type = 'dumped';
}
warn "OK\n";
view raw gistfile1.pl hosted with ❤ by GitHub

ADD COMMENT
0
Entering edit mode
11.1 years ago

Looks like your CDS' are the exons, only that the CDS' also include the stop codon that is not actually part of the mRNA.

I don't think that there is a tool to do what you need in one step.

ADD COMMENT
0
Entering edit mode

Right, the CDS are the exons except when interrupted by a start (stop) codon, in which case the exon includes the five (three) prime UTR.... I guess?

ADD REPLY
0
Entering edit mode

the definition for these is actually a lot more complicated, and I suspect tool developers may be a little cavalier in labeling. I would not be surprised if there were inconsistencies along the way. It all depends what is the file needed for.

Exon: http://www.sequenceontology.org/browser/current_svn/term/SO:0000147

CDS: http://www.sequenceontology.org/browser/current_svn/term/SO:0000316

ADD REPLY
0
Entering edit mode

The definitions are (now) clear (and the GFF validates OK), the pain is knowing if your CDS abuts a five (three) prime UTR (or both!) and if your five (three) prime UTR is a separate exon... Actually, my solution has been ignoring the intron features, these let me solve it actually! I'll post Perl when I'm done.

ADD REPLY

Login before adding your answer.

Traffic: 2321 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6