I am working on assembling and annotating the genome of a non-model organism, and I have a set of about 3k genes from this genome that I am using to train my ab initio gene predictors. For Augustus, I am following the training procedure documented on this page. I converted the data to GenBank format and split the data into a training set and a test set, each containing 1.5k annotated sequences. After making the appropriate parameter/config files for this species, I launched the optimize_augustus.pl script with the 1.5k training sequences.
The page includes the caveat that this script likely has to run overnight. However, it has been going for over 2 days now and shows no sign of stopping. I'm guessing this is this taking so long because of the number of training sequences I have--the documentation recommends about 200 genes, whereas I have nearly 10 times that. Is this intuition correct? What runtimes have you had when training Augustus?