Question: Gff3 + Fasta To Genbank (Augustus Training Set)
12
gravatar for Darked89
8.4 years ago by
Darked894.2k
Barcelona, Spain
Darked894.2k wrote:

I got gff3 files (PASA ESTs mapping, exonerate protein2genome mapping) and would like to use it as an Augustus training set. Problem is, Augustus requires Genbank format. Before starting to write my own converter: are there any available already one can recommend?

gff genbank conversion • 18k views
ADD COMMENTlink written 8.4 years ago by Darked894.2k
7
gravatar for Brad Chapman
8.4 years ago by
Brad Chapman9.3k
Boston, MA
Brad Chapman9.3k wrote:

This example script uses Biopython and the in-development python GFF parser. This is essentially the same approach as BioPerl: GFF + Fasta -> generic SeqFeatures -> GenBank. Depending on how nicely the GFF is formatted, this should at least give you a close representation that can be post-processed to exactly what you need:

"""Convert a GFF and associated FASTA file into GenBank format.

Usage:
gff_to_genbank.py <GFF annotation file> <FASTA sequence file>
"""
import sys
import os

from Bio import SeqIO
from Bio.Alphabet import generic_dna
from BCBio import GFF

def main(gff_file, fasta_file):
    out_file = "%s.gb" % os.path.splitext(gff_file)[0]
    fasta_input = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta", generic_dna))
    gff_iter = GFF.parse(gff_file, fasta_input)
    SeqIO.write(gff_iter, out_file, "genbank")

if __name__ == "__main__":
    main(*sys.argv[1:])
ADD COMMENTlink written 8.4 years ago by Brad Chapman9.3k

Many thanks, you saved my day. Works like a charm [Python 2.6.4 + Biopython 1.55] for producing Genbank file. Did not work with Biopython 1.53. I will check how Augustus likes it and keep you posted.

ADD REPLYlink written 8.4 years ago by Darked894.2k

PASA maps ESTs to a whole genome (many contigs). One can limit the size of the file by selecting as fasta input only these contigs to which there is a PASA match. awk/uniq + biopython did the work for me. Going further would be to create some mini-genes containing the matches + some flanks, but that may or may not be desired for Augustus training.

ADD REPLYlink written 8.4 years ago by Darked894.2k

Awesome, glad that worked. Yes, this will require a recent Biopython as the code passes filenames directly to SeqIO; it's a good idea to update anyways to get the latest fixes. Limiting the GFF and fasta files is the right way to go. This is a memory hungry implementation. With some care and GFF files organized by record names, you could build an iterated version that works one record at a time.

ADD REPLYlink written 8.4 years ago by Brad Chapman9.3k

AUGUSTUS complains about no 'source' info in Genbank file. Fixed the numbers using biopython. Next had to replace SOURCE with source (Augustus bug), finally one has to have 'CDS' or 'mRNA' for training. Replacing 'cDNA_match' with 'mRNA' does not fix it. Apparently it requires join( (1083..1379, 1503..1595,1865..1930) entry.

ADD REPLYlink written 8.4 years ago by Darked894.2k

Darek -- yes, it will just translate names directly from the original GFF file, so you may have to adjust if they don't match what Augustus wants. It should build join entries if the GFF has nested Parent/Child features. If you post a GFF and fasta example that should help to see what is going on.

ADD REPLYlink written 8.4 years ago by Brad Chapman9.3k

Sorry for not answering sooner: viruses got me. Problem is, at this stage all sequence data is still confidential, so I will rerun PASA with some known species next week. While there is a working script in AUGUSTUS (see gvj answer), there is still a place for python solution.

ADD REPLYlink written 8.4 years ago by Darked894.2k

i think this would be a good script to put on the example gff parse page for biopython: http://biopython.org/wiki/GFF_Parsing

ADD REPLYlink written 6.9 years ago by Zach Powers340

Zach, a generalized version of this script is included with the GFF library: https://github.com/chapmanb/bcbb/blob/master/gff/Scripts/gff/gff_to_genbank.py If you'd like to add details to the wiki documentation that would be very welcome. Thanks so much for the feedback.

ADD REPLYlink written 6.9 years ago by Brad Chapman9.3k

AUGUSTUS needs a non-protein-coding flanking region around the gene in the genbank file. I don't see this being produced in the python script.

ADD REPLYlink written 6.3 years ago by katharina.hoff70
4
gravatar for Gvj
8.4 years ago by
Gvj440
Netherlands
Gvj440 wrote:

Augustus comes with a script for this gff2gbSmallDNA.pl). you can find it under ~Augustus/script/

ADD COMMENTlink written 8.4 years ago by Gvj440
3
gravatar for iw9oel_ad
8.4 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

I can't suggest a finished solution, but Bioperl has some components that may help.

There is a GFF3 parser in Bio::Tools::GFF that will allow creation of Bio::SeqFeatureI from GFF3. You'd still need to map the SO terms in the GFF to Genbank.

Bio::SeqFeature::Tools::TypeMapper provides a mapper the other way (Genbank to GFF3), but it can be given a custom mapping. You can probably invert its own mapping and feed it back to it, to make a mapper that goes from SO terms to Genbank feature types. Afterwards, you can use Bio::SeqIO to output Genbank.

ADD COMMENTlink written 8.4 years ago by iw9oel_ad6.0k
3
gravatar for Chase Miller
8.4 years ago by
Chase Miller400
University of Utah
Chase Miller400 wrote:

Here's an attempt at the solution from the BioPerl mailing list

http://bioperl.org/pipermail/bioperl-l/2005-February/018208.html

ADD COMMENTlink written 8.4 years ago by Chase Miller400
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 963 users visited in the last hour