Question: Error encountered while initial training with augustus for gene prediction of non model organism
0
gravatar for lakhujanivijay
2.4 years ago by
lakhujanivijay4.5k
India
lakhujanivijay4.5k wrote:

Hi all,

I am trying to train a model for gene prediction of a non model plant species using the data set from arabidopsis thaliana. I am referring this tutorial and trying to follow the steps:

Steps followed so far:

(1) Download arabidopsis data, as provided by this tutorial; this is an example set:

wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gbff.gz

(2) Randomly split the set of annotated sequences in a training and a test set.

randomSplit.pl GCF_000001735.3_TAIR10_genomic.gbff 4

NOTE: I know that 4 is extremely low number and that there should be at least 200 genes to be used as a training set; I am trying to see what all steps needs to be executed before I run the same with actual data set.

(3) Create the files for training "my_genome" from a template.

new_species.pl --species=my_genome

(4) Make initial training set

etraining --species=my_genome GCF_000001735.3_TAIR10_genomic.gbff.train

Error encountered at this step which say:

Constructing GenBank feature: Feature begins after it ends: 9388571,9389420..9390450
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
Constructing GenBank feature: Feature begins after it ends: 1828296,1828395..1828689,1829291..1829438,1829624..1830211
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
CDS contains character c
GBProcessor::getGeneList(): GBProcessor::getJoin( ):  failed!!!
Encountered error after reading 0 annotations.

/augustus-3.2.3/bin/etraining: ERROR
    No genbank sequences found.

Question:

I am just running the demo data set which is expected to run without any issue. The message CDS contains character c is quite confusing. Any clues ?

EDIT 1: There are indeed sequences in the genbank file

grep "^LOCUS" GCF_000001735.3_TAIR10_genomic.gbff* -c
GCF_000001735.3_TAIR10_genomic.gbff:7
GCF_000001735.3_TAIR10_genomic.gbff.test:4
GCF_000001735.3_TAIR10_genomic.gbff.train:3
ADD COMMENTlink modified 12 months ago by smrutimayipanda10 • written 2.4 years ago by lakhujanivijay4.5k

Hi,

I am having the same problem, did you already figure out how to solve it?

Thank you so much in advance,

Cristina Osuna

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by cristina.osuna.cruz10

Hi Cristina

No, the problem remains the same. What is your organism? What files do you have?

~Vijay

ADD REPLYlink written 2.3 years ago by lakhujanivijay4.5k

Hi, I am getting the same problem, can you please help me out if you had solved it?

ADD REPLYlink written 13 months ago by smrutimayipanda10

Unfortunately, I could not

ADD REPLYlink written 13 months ago by lakhujanivijay4.5k

I have done the augustus training a little bit different so working now. thanks!!!!

ADD REPLYlink written 12 months ago by smrutimayipanda10
0
gravatar for bowwow
13 months ago by
bowwow0
Australia
bowwow0 wrote:

Have you seen this: https://github.com/tseemann/prokka/issues/32

ADD COMMENTlink written 13 months ago by bowwow0

no i haven't checked

ADD REPLYlink written 12 months ago by smrutimayipanda10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1111 users visited in the last hour