Question

Error encountered while initial training with augustus for gene prediction of non model organism

0

Entering edit mode

6.8 years ago

lakhujanivijay 5.8k

Hi all,

I am trying to train a model for gene prediction of a non model plant species using the data set from arabidopsis thaliana. I am referring this tutorial and trying to follow the steps:

Steps followed so far:

(1) Download arabidopsis data, as provided by this tutorial; this is an example set:

wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gbff.gz

(2) Randomly split the set of annotated sequences in a training and a test set.

randomSplit.pl GCF_000001735.3_TAIR10_genomic.gbff 4

NOTE: I know that 4 is extremely low number and that there should be at least 200 genes to be used as a training set; I am trying to see what all steps needs to be executed before I run the same with actual data set.

(3) Create the files for training "my_genome" from a template.

new_species.pl --species=my_genome

(4) Make initial training set

etraining --species=my_genome GCF_000001735.3_TAIR10_genomic.gbff.train

Error encountered at this step which say:

Constructing GenBank feature: Feature begins after it ends: 9388571,9389420..9390450
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
Constructing GenBank feature: Feature begins after it ends: 1828296,1828395..1828689,1829291..1829438,1829624..1830211
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
CDS contains character c
GBProcessor::getGeneList(): GBProcessor::getJoin( ):  failed!!!
Encountered error after reading 0 annotations.

/augustus-3.2.3/bin/etraining: ERROR
    No genbank sequences found.

Question:

I am just running the demo data set which is expected to run without any issue. The message CDS contains character c is quite confusing. Any clues ?

EDIT 1: There are indeed sequences in the genbank file

grep "^LOCUS" GCF_000001735.3_TAIR10_genomic.gbff* -c
GCF_000001735.3_TAIR10_genomic.gbff:7
GCF_000001735.3_TAIR10_genomic.gbff.test:4
GCF_000001735.3_TAIR10_genomic.gbff.train:3

gene prediction augustus genbank nonmodel • 4.0k views

ADD COMMENT • link updated 2.4 years ago by Dorine ▴ 20 • written 6.8 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Hi,

I am having the same problem, did you already figure out how to solve it?

Thank you so much in advance,

Cristina Osuna

ADD REPLY • link 6.7 years ago by cristina.osuna.cruz ▴ 10

0

Entering edit mode

Hi Cristina

No, the problem remains the same. What is your organism? What files do you have?

~Vijay

ADD REPLY • link 6.7 years ago by lakhujanivijay 5.8k

0

Entering edit mode

Hi, I am getting the same problem, can you please help me out if you had solved it?

ADD REPLY • link 5.5 years ago by smrutimayipanda ▴ 20

0

Entering edit mode

Unfortunately, I could not

ADD REPLY • link 5.5 years ago by lakhujanivijay 5.8k

0

Entering edit mode

I have done the augustus training a little bit different so working now. thanks!!!!

ADD REPLY • link 5.5 years ago by smrutimayipanda ▴ 20

2

Entering edit mode

Hi, I am currently annotating some genomes. I have the same problem you had. I know your post is a little dated, but do you remember how you solved it? Thanks :)

ADD REPLY • link 2.4 years ago by Dorine ▴ 20

score 0 · Answer 1 · 2018-10-19

0

Entering edit mode

5.5 years ago

bowwow ▴ 10

Have you seen this: https://github.com/tseemann/prokka/issues/32

ADD COMMENT • link 5.5 years ago by bowwow ▴ 10

0

Entering edit mode

no i haven't checked

ADD REPLY • link 5.5 years ago by smrutimayipanda ▴ 20