I got gff3 files (PASA ESTs mapping, exonerate protein2genome mapping) and would like to use it as an Augustus training set. Problem is, Augustus requires Genbank format. Before starting to write my own converter: are there any available already one can recommend?
This example script uses Biopython and the in-development python GFF parser. This is essentially the same approach as BioPerl: GFF + Fasta -> generic SeqFeatures -> GenBank. Depending on how nicely the GFF is formatted, this should at least give you a close representation that can be post-processed to exactly what you need:
"""Convert a GFF and associated FASTA file into GenBank format. Usage: gff_to_genbank.py <GFF annotation file> <FASTA sequence file> """ import sys import os from Bio import SeqIO from Bio.Alphabet import generic_dna from BCBio import GFF def main(gff_file, fasta_file): out_file = "%s.gb" % os.path.splitext(gff_file) fasta_input = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta", generic_dna)) gff_iter = GFF.parse(gff_file, fasta_input) SeqIO.write(gff_iter, out_file, "genbank") if __name__ == "__main__": main(*sys.argv[1:])
I can't suggest a finished solution, but Bioperl has some components that may help.
There is a GFF3 parser in Bio::Tools::GFF that will allow creation of Bio::SeqFeatureI from GFF3. You'd still need to map the SO terms in the GFF to Genbank.
Bio::SeqFeature::Tools::TypeMapper provides a mapper the other way (Genbank to GFF3), but it can be given a custom mapping. You can probably invert its own mapping and feed it back to it, to make a mapper that goes from SO terms to Genbank feature types. Afterwards, you can use Bio::SeqIO to output Genbank.
Here's an attempt at the solution from the BioPerl mailing list