Question: How can you measure the completeness of an annotation process?
0
gravatar for eennadi
4 weeks ago by
eennadi0
eennadi0 wrote:

I am annotating a plant genome using Maker-P. I used EST and transcriptome data. I reduced the redunancy in the EST using cdhit. After three rounds of Maker( EST2genome and protein2genome followed by training SNAP twice and training Augustus twice) I now have a total set of genes. I am expecting more genes than I now have, although this is a novel genome with no reference.

How can I tell if my annotation is complete?

Thanks

assembly • 140 views
ADD COMMENTlink modified 29 days ago • written 4 weeks ago by eennadi0

What is your expectation based on? You could compare with related species.

ADD REPLYlink written 4 weeks ago by Jean-Karim Heriche23k

Closely related species have gene counts of about 26,857, 23,197 , 22,427 but the paper that reported this had a Complete (%) to CEGs by CEGMA pipeline 86.29

ADD REPLYlink written 29 days ago by eennadi0

And how many do you have?

ADD REPLYlink written 29 days ago by Juke344.9k

I have 17973 with a BUSCO of C:68.4%[S:64.5%,D:3.9%],F:6.0%,M:25.6%,n:1440

The BUSCO score for the genome assembly is 93.7%

ADD REPLYlink written 29 days ago by eennadi0

I ran BUSCO with this commanline

python /mnt/bin/busco/scripts/run_BUSCO.py -i  ~.maker.transcripts.fasta -o output -l ${LINEAGE} -m transcriptome -c 15  -sp my_species  -z --augustus_parameters='--progress=true'

C:68.4%[S:64.5%,D:3.9%],F:6.0%,M:25.6%,n:1440

The BUSCO score for the genome assembly is 93.7%

ADD REPLYlink modified 29 days ago by genomax92k • written 29 days ago by eennadi0

You lost 25% of the Busco genes during the annotation process. This is not good

ADD REPLYlink written 29 days ago by Juke344.9k

I am trying to use Braker for re-annotation and to evaluate. But BRAKER has been very difficult to use. It keeps dying without any error.

Do you have any suggestion on how to recover the lost 25% BUSCO?

ADD REPLYlink written 29 days ago by eennadi0

Did you activate the keep_pred parameter?

ADD REPLYlink written 29 days ago by Juke344.9k

No I did not activate the keep_pred. When I set keep_pred=1 it gives proteins with AED of 1 see example:
mRNA-1 protein AED:1.00 eAED:1.00 QI:0|0|0|0|1|1|6|0|661

ADD REPLYlink written 29 days ago by eennadi0

Normal it adds prediction that do not have any support from the evidence (est or protein)

ADD REPLYlink written 29 days ago by Juke344.9k

Can one proceed with these unsupported predictions?

ADD REPLYlink written 29 days ago by eennadi0

So run with keep_preds. If you have between 25000 and 30000 genes is fine, your busco will be much better. Then yiu can also give a try without snap and check the busco. Deactivating can give better results

ADD REPLYlink written 29 days ago by Juke344.9k
0
gravatar for Juke34
4 weeks ago by
Juke344.9k
Sweden
Juke344.9k wrote:

Run BUSCO on you assembly. Then get the protein you have predicted (all of them with isoforms) and run BUSCO in protein mode. Compare the global result (do not care about duplicated ones) You should have something pretty close. If your Busco on proteins is way below you have a problem in the annotations steps.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Juke344.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1322 users visited in the last hour