Error encountered while initial training with augustus for gene prediction of non model organism
1
0
Entering edit mode
4.5 years ago

Hi all,

I am trying to train a model for gene prediction of a non model plant species using the data set from arabidopsis thaliana. I am referring this tutorial and trying to follow the steps:

Steps followed so far:

(1) Download arabidopsis data, as provided by this tutorial; this is an example set:

wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gbff.gz

(2) Randomly split the set of annotated sequences in a training and a test set.

randomSplit.pl GCF_000001735.3_TAIR10_genomic.gbff 4

NOTE: I know that 4 is extremely low number and that there should be at least 200 genes to be used as a training set; I am trying to see what all steps needs to be executed before I run the same with actual data set.

(3) Create the files for training "my_genome" from a template.

new_species.pl --species=my_genome

(4) Make initial training set

etraining --species=my_genome GCF_000001735.3_TAIR10_genomic.gbff.train

Error encountered at this step which say:

Constructing GenBank feature: Feature begins after it ends: 9388571,9389420..9390450
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
Constructing GenBank feature: Feature begins after it ends: 1828296,1828395..1828689,1829291..1829438,1829624..1830211
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
CDS contains character c
GBProcessor::getGeneList(): GBProcessor::getJoin( ):  failed!!!
Encountered error after reading 0 annotations.

/augustus-3.2.3/bin/etraining: ERROR
    No genbank sequences found.

Question:

I am just running the demo data set which is expected to run without any issue. The message CDS contains character c is quite confusing. Any clues ?

EDIT 1: There are indeed sequences in the genbank file

grep "^LOCUS" GCF_000001735.3_TAIR10_genomic.gbff* -c
GCF_000001735.3_TAIR10_genomic.gbff:7
GCF_000001735.3_TAIR10_genomic.gbff.test:4
GCF_000001735.3_TAIR10_genomic.gbff.train:3
gene prediction augustus genbank nonmodel • 2.8k views
ADD COMMENT
0
Entering edit mode

Hi,

I am having the same problem, did you already figure out how to solve it?

Thank you so much in advance,

Cristina Osuna

ADD REPLY
0
Entering edit mode

Hi Cristina

No, the problem remains the same. What is your organism? What files do you have?

~Vijay

ADD REPLY
0
Entering edit mode

Hi, I am getting the same problem, can you please help me out if you had solved it?

ADD REPLY
0
Entering edit mode

Unfortunately, I could not

ADD REPLY
0
Entering edit mode

I have done the augustus training a little bit different so working now. thanks!!!!

ADD REPLY
0
Entering edit mode

Hi, I am currently annotating some genomes. I have the same problem you had. I know your post is a little dated, but do you remember how you solved it? Thanks :)

ADD REPLY
0
Entering edit mode
3.3 years ago
bowwow ▴ 10

Have you seen this: https://github.com/tseemann/prokka/issues/32

ADD COMMENT
0
Entering edit mode

no i haven't checked

ADD REPLY

Login before adding your answer.

Traffic: 2438 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6