Convert an 'intron-style' GFF3 file into an 'exon-style' GFF3 file

2

Entering edit mode

11.1 years ago

Dan ▴ 540

I have a GFF3 file that doesn't have exons, instead it has introns, UTRs, start and stop codons:

0001.scaffold00002      AUGUSTUS        gene    1386    2772    0.12    +       .       ID=Bv_00001z1_qhas;Name=Bv_00001z1_qhas
0001.scaffold00002      AUGUSTUS        mRNA    1386    2772    0.12    +       .       ID=Bv_00001z1_qhas.t1;Parent=Bv_00001z1_qhas;Name=Bv_00001z1_qhas.t1 0%;Note=cDNAcoverage_0%
0001.scaffold00002      AUGUSTUS        five_prime_UTR  1386    1976    .       +       .       ID=Bv_00001z1_qhas.t1.UTR;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        start_codon     1977    1979    .       +       0       ID=Bv_00001z1_qhas.t1.start_codon;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        CDS     1977    2325    0.96    +       0       ID=Bv_00001z1_qhas.t1.CDS;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        intron  2326    2619    0.81    +       .       ID=Bv_00001z1_qhas.t1.intron;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        CDS     2620    2747    0.8     +       2       ID=Bv_00001z1_qhas.t1.CDS;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        stop_codon      2745    2747    .       +       0       ID=Bv_00001z1_qhas.t1.stop_codon;Parent=Bv_00001z1_qhas.t1
0001.scaffold00002      AUGUSTUS        three_prime_UTR 2748    2772    .       +       .       ID=Bv_00001z1_qhas.t1.UTR;Parent=Bv_00001z1_qhas.t1

I can convert this to 'exon-style' by calculating the exons from the above, but I'm wondering if there is an 'off the shelf' solution?

Cheers,
Dan.

GFF3 intron exon format conversion • 6.2k views

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by Dan ▴ 540

5

Entering edit mode

11.1 years ago

Dan ▴ 540

Actually, this can be done with GenomeTools. The dupfeat command duplicates features of type -source and outputs the copies with type dest. The mergefeat command merges adjacent features of the same type:

gt dupfeat -dest exon -source CDS your.gff3 \
  | gt dupfeat -dest exon -source three_prime_UTR \
  | gt dupfeat -dest exon -source five_prime_UTR \
  | gt mergefeat \
  | gt gff3 -retainids -sort -tidy -o your.new.gff3

Pretty slick!

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by Dan ▴ 540

0

Entering edit mode

good to know

ADD REPLY • link 11.1 years ago by Istvan Albert 102k

0

Entering edit mode

GenomeTools is the answer for many questions I have about GFF3 processing!

ADD REPLY • link 10.8 years ago by Daniel Standage 4.1k

2

Entering edit mode

11.1 years ago

Dan ▴ 540

Here is my answer in full, complicated by the fact that the dumb format wasn't consistent in it's stupidity:

	#! perl

	use strict;
	use warnings;

	die "pass an intron-style GFF for me to calcualte exons over\n"
	unless @ARGV;

	## Tracking variables
	my $prev_type = 'dumped';
	my $prev_beg = -1;
	my $prev_end = -1;

	## Variables used for formatting only
	my $prev_seq;
	my $prev_ori;
	my $prev_parent;

	while(<>){
	if (/^#/){
	print unless /^###$/;
	next;
	}

	my ($seq, $meth, $type, $beg, $end, $score, $ori, $phase, $nine)
	= split /\t/;

	## Time to dump any pending exonic regions and reset our counters...
	if($type eq 'gene' \|\|
	$type eq 'mRNA' \|\|
	$type eq 'intron'){
	if ($prev_type ne 'dumped'){
	print join("\t", $prev_seq, "AUGUSTUS", "exon",
	$prev_beg, $prev_end, '.', $prev_ori, '.',
	"Parent=$prev_parent"), "\n";
	$prev_type = 'dumped';
	$prev_beg = -1;
	$prev_end = -1;
	}
	## Grab some book keeping stuff...
	if ($type eq 'mRNA'){
	$prev_seq = $seq;
	$prev_ori = $ori;
	die unless /\tID=(\S+?);/;
	$prev_parent = $1;
	}
	}

	## One exception is when a UTR follows a UTR. This handy GFF
	## decides not to put an intron here... just for consistency!
	elsif (($type eq 'five_prime_UTR' \|\|
	$type eq 'three_prime_UTR') && $type eq $prev_type){
	print join("\t", $prev_seq, "AUGUSTUS", "exon",
	$prev_beg, $prev_end, '.', $prev_ori, '.',
	"Parent=$prev_parent"), "\n";
	$prev_type = $type;
	$prev_beg = $beg;
	$prev_end = $end;
	}

	## Er.. back where we were...
	elsif ($type eq 'five_prime_UTR' \|\|
	$type eq 'three_prime_UTR' \|\|
	$type eq 'CDS'){

	## We don't dump an exon here!

	## We just set the start position if needed
	$prev_beg = $beg if $prev_type eq 'dumped';

	## and increment the end
	$prev_end = $end;

	## and record this to handle the exception above...
	$prev_type = $type;
	}

	## We don't care about any other types
	print;
	}

	## Finishing touches:
	if ($prev_type ne 'dumped'){
	print join("\t", $prev_seq, "AUGUSTUS", "exon",
	$prev_beg, $prev_end, '.', $prev_ori, '.',
	"Parent=$prev_parent"), "\n";
	$prev_type = 'dumped';
	}

	warn "OK\n";

view raw gistfile1.pl hosted with ❤ by GitHub

ADD COMMENT • link updated 5.5 years ago by Ram 45k • written 11.1 years ago by Dan ▴ 540

0

Entering edit mode

11.1 years ago

Istvan Albert 102k

Looks like your CDS' are the exons, only that the CDS' also include the stop codon that is not actually part of the mRNA.

I don't think that there is a tool to do what you need in one step.

ADD COMMENT • link 11.1 years ago by Istvan Albert 102k

0

Entering edit mode

Right, the CDS are the exons except when interrupted by a start (stop) codon, in which case the exon includes the five (three) prime UTR.... I guess?

ADD REPLY • link 11.1 years ago by Dan ▴ 540

0

Entering edit mode

the definition for these is actually a lot more complicated, and I suspect tool developers may be a little cavalier in labeling. I would not be surprised if there were inconsistencies along the way. It all depends what is the file needed for.

Exon: http://www.sequenceontology.org/browser/current_svn/term/SO:0000147

CDS: http://www.sequenceontology.org/browser/current_svn/term/SO:0000316

ADD REPLY • link updated 5.5 years ago by Ram 45k • written 11.1 years ago by Istvan Albert 102k

0

Entering edit mode

The definitions are (now) clear (and the GFF validates OK), the pain is knowing if your CDS abuts a five (three) prime UTR (or both!) and if your five (three) prime UTR is a separate exon... Actually, my solution has been ignoring the intron features, these let me solve it actually! I'll post Perl when I'm done.

ADD REPLY • link 11.1 years ago by Dan ▴ 540

Login before adding your answer.