Error encountered while initial training with augustus for gene prediction of non model organism
1
0
Entering edit mode
4.5 years ago

Hi all,

I am trying to train a model for gene prediction of a non model plant species using the data set from arabidopsis thaliana. I am referring this tutorial and trying to follow the steps:

Steps followed so far:

(1) Download arabidopsis data, as provided by this tutorial; this is an example set:

wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gbff.gz


(2) Randomly split the set of annotated sequences in a training and a test set.

randomSplit.pl GCF_000001735.3_TAIR10_genomic.gbff 4


NOTE: I know that 4 is extremely low number and that there should be at least 200 genes to be used as a training set; I am trying to see what all steps needs to be executed before I run the same with actual data set.

(3) Create the files for training "my_genome" from a template.

new_species.pl --species=my_genome


(4) Make initial training set

etraining --species=my_genome GCF_000001735.3_TAIR10_genomic.gbff.train


Error encountered at this step which say:

Constructing GenBank feature: Feature begins after it ends: 9388571,9389420..9390450
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
Constructing GenBank feature: Feature begins after it ends: 1828296,1828395..1828689,1829291..1829438,1829624..1830211
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
CDS contains character c
GBProcessor::getGeneList(): GBProcessor::getJoin( ):  failed!!!
Encountered error after reading 0 annotations.

/augustus-3.2.3/bin/etraining: ERROR
No genbank sequences found.


Question:

I am just running the demo data set which is expected to run without any issue. The message CDS contains character c is quite confusing. Any clues ?

EDIT 1: There are indeed sequences in the genbank file

grep "^LOCUS" GCF_000001735.3_TAIR10_genomic.gbff* -c
GCF_000001735.3_TAIR10_genomic.gbff:7
GCF_000001735.3_TAIR10_genomic.gbff.test:4
GCF_000001735.3_TAIR10_genomic.gbff.train:3

gene prediction augustus genbank nonmodel • 2.8k views
0
Entering edit mode

Hi,

I am having the same problem, did you already figure out how to solve it?

Thank you so much in advance,

Cristina Osuna

0
Entering edit mode

Hi Cristina

No, the problem remains the same. What is your organism? What files do you have?

~Vijay

0
Entering edit mode

0
Entering edit mode

Unfortunately, I could not

0
Entering edit mode

I have done the augustus training a little bit different so working now. thanks!!!!

0
Entering edit mode

Hi, I am currently annotating some genomes. I have the same problem you had. I know your post is a little dated, but do you remember how you solved it? Thanks :)

0
Entering edit mode
3.3 years ago
bowwow ▴ 10

Have you seen this: https://github.com/tseemann/prokka/issues/32

0
Entering edit mode

no i haven't checked