Question: Gff to genbank - feature is missing
0
gravatar for rororo
5 weeks ago by
rororo0
rororo0 wrote:

I want to predict genes from my organism of interest using augustus. I therefore have my assembly, as well as a gff3 file. I want to train augustus using my dataset using etraining. I therefore need a genbank file.

I found link with the following script to convert gff+fasta to genbank, since the provided gff2Smallgb.pl from augustus does not work:

#!/usr/bin/env python
"""Convert a GFF and associated FASTA file into GenBank format.

Usage:
    gff_to_genbank.py <GFF annotation="" file=""> <FASTA sequence="" file="">
"""
from __future__ import print_function

import sys
import os

from Bio import SeqIO
from Bio.Alphabet import generic_dna
from Bio import Seq

from BCBio import GFF

def main(gff_file, fasta_file):
    out_file = "%s.gb" % os.path.splitext(gff_file)[0]
    fasta_input = SeqIO.to_dict(SeqIO.parse(fasta_file, "fasta", generic_dna))
    gff_iter = GFF.parse(gff_file, fasta_input)
    SeqIO.write(_check_gff(_fix_ncbi_id(gff_iter)), out_file, "genbank")

def _fix_ncbi_id(fasta_iter):
    """GenBank identifiers can only be 16 characters; try to shorten NCBI.
    """
    for rec in fasta_iter:
        if lenrec.name) > 16 and rec.name.find("|") > 0:
            new_id = [x for x in rec.name.split("|") if x][-1]
            print("Warning: shortening NCBI name %s to %s" % rec.id, new_id))
            rec.id = new_id
            rec.name = new_id
        yield rec

def _check_gff(gff_iterator):
    """Check GFF files before feeding to SeqIO to be sure they have sequences.
    """
    for rec in gff_iterator:
        if isinstance(rec.seq, Seq.UnknownSeq):
            print("Warning: FASTA sequence not found for '%s' in GFF file" % (
                    rec.id))
            rec.seq.alphabet = generic_dna
        yield _flatten_features(rec)

def _flatten_features(rec):
    """Make sub_features in an input rec flat for output.

    GenBank does not handle nested features, so we want to make
    everything top level.
    """
    out = []
    for f in rec.features:
        cur = [f]
        while len(cur) > 0:
            nextf = []
            for curf in cur:
                out.append(curf)
                if len(curf.sub_features) > 0:
                    nextf.extend(curf.sub_features)
            cur = nextf
    rec.features = out
    return rec

if __name__ == "__main__":
    main(*sys.argv[1:])

Unfortunately, the feature source is not printed, so that augustus etraining logs the following error:

GBProcessor::getGeneList(): Sequence has 0 length. Maybe 'source' Feature missing?

A model genbank file is available here: NCBI genbank format

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by rororo0

Isn't the source field originating in GFF file? GFF format says this:

source - name of the program that generated this feature, or the data source (database or project name)

So try something you like as source name in your GFF.

ADD REPLYlink written 5 weeks ago by genomax51k

source in the FEATURES table

FEATURES             Location/Qualifiers
     source          1..5028
                     /organism="Saccharomyces cerevisiae"
                     /db_xref="taxon:4932"
                     /chromosome="IX"
                     /map="9"
     CDS             <1..206
ADD REPLYlink written 5 weeks ago by rororo0

Then you will need to create that yourself. If you look in the model genbank page you linked above they provide some indication of what can go in the SOURCE section.

3.4.10 SOURCE Format

  The SOURCE field consists of two parts. The first part is found after
the SOURCE keyword and contains free-format information including an
abbreviated form of the organism name followed by a molecule type;
multiple lines are allowed, but the last line must end with a period.
The second part consists of information found after the ORGANISM
subkeyword. The formal scientific name for the source organism (genus
and species, where appropriate) is found on the same line as ORGANISM.
The records following the ORGANISM line list the taxonomic
classification levels, separated by semicolons and ending with a
period.
ADD REPLYlink written 5 weeks ago by genomax51k

I have 25,000 entries. So I need an automatic way!

ADD REPLYlink written 5 weeks ago by rororo0

If you wrote the script above then you should be able to do that, correct?

ADD REPLYlink written 5 weeks ago by genomax51k

sorry, I clearified the source!

ADD REPLYlink written 5 weeks ago by rororo0

seqret can be used to convert gff+fasta to Genbank?

seqret -sequence genome.fasta -feature -fformat gff -fopenfile annotation.gff -osformat genbank

Another alternative would be to open fasta file in Artemis -> Read GFF3 as an entry within it and save the whole thing as genbank file.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Sej Modha3.1k

thanks for your help. Unfortunately, seqret does not stop converting - the file just gets bigger and bigger. Artemis freezes when opening the gff-file

ADD REPLYlink written 5 weeks ago by rororo0

Gff3 + Fasta To Genbank (Augustus Training Set)

ADD REPLYlink written 5 weeks ago by Sej Modha3.1k

Script OP has referenced to came from a link in an answer from this thread.

ADD REPLYlink written 5 weeks ago by genomax51k

I see! Thanks genomax. @rororo have you tried the conversion file provided with Augustus as mentioned in the thread above? https://github.com/nextgenusfs/augustus/blob/master/scripts/gff2gbSmallDNA.pl

ADD REPLYlink written 5 weeks ago by Sej Modha3.1k

since the provided gff2Smallgb.pl from augustus does not work

Not sure if that is the same script but this is in the original thread.

ADD REPLYlink written 5 weeks ago by genomax51k

Not sure if that is the same script but this is in the original thread.

I think the original script is: https://github.com/chapmanb/bcbb/blob/master/gff/Scripts/gff/gff_to_genbank.py , not sure if OP has tried the script provided with Augustus.

ADD REPLYlink written 5 weeks ago by Sej Modha3.1k

this is the script I posted above

ADD REPLYlink written 5 weeks ago by rororo0

Unfortunately, seqret does not stop converting - the file just gets bigger and bigger.

If you have 25,000 entries then isn't that expected? You need to be patient at times. It may take some time to complete the conversion.

ADD REPLYlink written 5 weeks ago by genomax51k

as soon as the 25,000 entries are read and returned, seqret starts from the beginning and returns the same sequences again and again

ADD REPLYlink written 5 weeks ago by rororo0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 733 users visited in the last hour