Hi, did anyone trained augustus? I'm actually interested in training quality which you can estimate on test gene set. My pipeline in R for choosing training set is (I use gff from genbank): 1) In gff choose all the genes for which product is defined. NO hypothetical or predicted proteins. 2)remove all the alternative transcripts. 3) remove exon-less genes 4) check mRNA overlaps ( adding 1000 flanks) and get rid of overlapping genes 5) eventually I've decided to choose genes with annotated UTRs ( just >30 bp) as I've got better results with it. -UTRs I create in gff by myself
Resulting gff table with ~500 genes, CDS and UTR features, I turn to gb with augustus script, split it on ~350 train and test set. After etraining and checking on test set the best result I've got for gene prediction is about 0.5 Optimizing doesn't help a lot In tutorial it was suggested in bug_parameters.cfg turn "excludestopocodon..." to TRUE. Which in my case makes training quality even worse.
So main questions is what gene/ exon/UTR prediction qualities you get? Should they be so low? Do you see some fail in my pipeline and what are your suggestions about it?