Question: Error encountered while initial training with augustus for gene prediction of non model organism
0
gravatar for Vijay Lakhujani
17 months ago by
Vijay Lakhujani3.4k
India
Vijay Lakhujani3.4k wrote:

Hi all,

I am trying to train a model for gene prediction of a non model plant species using the data set from arabidopsis thaliana. I am referring this tutorial and trying to follow the steps:

Steps followed so far:

(1) Download arabidopsis data, as provided by this tutorial; this is an example set:

wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gbff.gz

(2) Randomly split the set of annotated sequences in a training and a test set.

randomSplit.pl GCF_000001735.3_TAIR10_genomic.gbff 4

NOTE: I know that 4 is extremely low number and that there should be at least 200 genes to be used as a training set; I am trying to see what all steps needs to be executed before I run the same with actual data set.

(3) Create the files for training "my_genome" from a template.

new_species.pl --species=my_genome

(4) Make initial training set

etraining --species=my_genome GCF_000001735.3_TAIR10_genomic.gbff.train

Error encountered at this step which say:

Constructing GenBank feature: Feature begins after it ends: 9388571,9389420..9390450
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
Constructing GenBank feature: Feature begins after it ends: 1828296,1828395..1828689,1829291..1829438,1829624..1830211
GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format.
Encountered error after reading 0 annotations.
CDS contains character c
GBProcessor::getGeneList(): GBProcessor::getJoin( ):  failed!!!
Encountered error after reading 0 annotations.

/augustus-3.2.3/bin/etraining: ERROR
    No genbank sequences found.

Question:

I am just running the demo data set which is expected to run without any issue. The message CDS contains character c is quite confusing. Any clues ?

EDIT 1: There are indeed sequences in the genbank file

grep "^LOCUS" GCF_000001735.3_TAIR10_genomic.gbff* -c
GCF_000001735.3_TAIR10_genomic.gbff:7
GCF_000001735.3_TAIR10_genomic.gbff.test:4
GCF_000001735.3_TAIR10_genomic.gbff.train:3
ADD COMMENTlink modified 6 weeks ago by smrutimayipanda10 • written 17 months ago by Vijay Lakhujani3.4k

Hi,

I am having the same problem, did you already figure out how to solve it?

Thank you so much in advance,

Cristina Osuna

ADD REPLYlink modified 16 months ago • written 16 months ago by cristina.osuna.cruz0

Hi Cristina

No, the problem remains the same. What is your organism? What files do you have?

~Vijay

ADD REPLYlink written 16 months ago by Vijay Lakhujani3.4k

Hi, I am getting the same problem, can you please help me out if you had solved it?

ADD REPLYlink written 9 weeks ago by smrutimayipanda10

Unfortunately, I could not

ADD REPLYlink written 9 weeks ago by Vijay Lakhujani3.4k

I have done the augustus training a little bit different so working now. thanks!!!!

ADD REPLYlink written 6 weeks ago by smrutimayipanda10
0
gravatar for bowwow
8 weeks ago by
bowwow0
Australia
bowwow0 wrote:

Have you seen this: https://github.com/tseemann/prokka/issues/32

ADD COMMENTlink written 8 weeks ago by bowwow0

no i haven't checked

ADD REPLYlink written 6 weeks ago by smrutimayipanda10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 548 users visited in the last hour