I am trying to train a model for gene prediction of a non model plant species using the data set from arabidopsis thaliana. I am referring this tutorial and trying to follow the steps:
Steps followed so far:
(1) Download arabidopsis data, as provided by this tutorial; this is an example set:
wget -c ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.3_TAIR10/GCF_000001735.3_TAIR10_genomic.gbff.gz
(2) Randomly split the set of annotated sequences in a training and a test set.
randomSplit.pl GCF_000001735.3_TAIR10_genomic.gbff 4
NOTE: I know that 4 is extremely low number and that there should be at least 200 genes to be used as a training set; I am trying to see what all steps needs to be executed before I run the same with actual data set.
(3) Create the files for training "my_genome" from a template.
(4) Make initial training set
etraining --species=my_genome GCF_000001735.3_TAIR10_genomic.gbff.train
Error encountered at this step which say:
Constructing GenBank feature: Feature begins after it ends: 9388571,9389420..9390450 GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format. Encountered error after reading 0 annotations. Constructing GenBank feature: Feature begins after it ends: 1828296,1828395..1828689,1829291..1829438,1829624..1830211 GBProcessor::getGeneList(): GBFeature constructor:Format error when reading genbank format. Encountered error after reading 0 annotations. CDS contains character c GBProcessor::getGeneList(): GBProcessor::getJoin( ): failed!!! Encountered error after reading 0 annotations. /augustus-3.2.3/bin/etraining: ERROR No genbank sequences found.
I am just running the demo data set which is expected to run without any issue. The message
CDS contains character c is quite confusing. Any clues ?
EDIT 1: There are indeed sequences in the genbank file
grep "^LOCUS" GCF_000001735.3_TAIR10_genomic.gbff* -c GCF_000001735.3_TAIR10_genomic.gbff:7 GCF_000001735.3_TAIR10_genomic.gbff.test:4 GCF_000001735.3_TAIR10_genomic.gbff.train:3