How can you measure the completeness of an annotation process?
13 months ago

I am annotating a plant genome using Maker-P. I used EST and transcriptome data. I reduced the redunancy in the EST using cdhit. After three rounds of Maker( EST2genome and protein2genome followed by training SNAP twice and training Augustus twice) I now have a total set of genes. I am expecting more genes than I now have, although this is a novel genome with no reference.

How can I tell if my annotation is complete?

Thanks

What is your expectation based on? You could compare with related species.

Closely related species have gene counts of about 26,857, 23,197 , 22,427 but the paper that reported this had a Complete (%) to CEGs by CEGMA pipeline 86.29

And how many do you have?

I have 17973 with a BUSCO of C:68.4%[S:64.5%,D:3.9%],F:6.0%,M:25.6%,n:1440

The BUSCO score for the genome assembly is 93.7%

I ran BUSCO with this commanline

python /mnt/bin/busco/scripts/run_BUSCO.py -i  ~.maker.transcripts.fasta -o output -l \${LINEAGE} -m transcriptome -c 15  -sp my_species  -z --augustus_parameters='--progress=true'


You lost 25% of the Busco genes during the annotation process. This is not good

I am trying to use Braker for re-annotation and to evaluate. But BRAKER has been very difficult to use. It keeps dying without any error.

Do you have any suggestion on how to recover the lost 25% BUSCO?

Did you activate the keep_pred parameter?

No I did not activate the keep_pred. When I set keep_pred=1 it gives proteins with AED of 1 see example:
mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|6|0|661

Normal it adds prediction that do not have any support from the evidence (est or protein)

Can one proceed with these unsupported predictions?

So run with keep_preds. If you have between 25000 and 30000 genes is fine, your busco will be much better. Then yiu can also give a try without snap and check the busco. Deactivating can give better results

13 months ago
Juke34 ★ 6.4k

Run BUSCO on you assembly. Then get the protein you have predicted (all of them with isoforms) and run BUSCO in protein mode. Compare the global result (do not care about duplicated ones) You should have something pretty close. If your Busco on proteins is way below you have a problem in the annotations steps.