Question

Is BUSCO really better than CEGMA for genome assembly quality evaluation?

1

Entering edit mode

6.3 years ago

shelkmike ★ 1.2k

BUSCO is a successor to CEGMA and is often spoken about as being superior. However, I doubt that this is so. The thing is that CEGMA uses a set of ultra-conservative genes - the ones that are present in human, mouse, fruit fly, nematode, arabidopsis and yeasts. On the contrary, BUSCO uses genes that are single copy in at least 90% of species, thus the BUSCO criterion for inclusion of a gene in a reference set is less strict.

Thus, when I assemble a genome of some species and see that there are 95% of the CEGMA genes, I may be almost sure that approximately 95% of all genes of the species are assembled, since if a gene is present in human, mouse, fruit fly, nematode, arabidopsis and yeasts, it should be present in almost all eukaryotes, except some very exotic. On the other side, when I see that there are 95% of the BUSCO genes in my assembly, this doesn't really tell me how good my assembly is, since there is an ambiguity: the genome of my species may contain 95% of the BUSCO genes and thus the assembly is perfect, or, alternatively, the genome may contain 100% of the BUSCO genes and then the assembly is not perfect.

The question is: am I right that BUSCO is worse than CEGMA for estimation of assembly completeness?

Genome assembly BUSCO CEGMA • 5.2k views

ADD COMMENT • link updated 6.1 years ago by h.mon 35k • written 6.3 years ago by shelkmike ★ 1.2k

score 3 · Answer 1 · 2018-01-19

3

Entering edit mode

6.3 years ago

lieven.sterck 15k

tough question ;-)

I can only point you to this publication which sheds some more light on this issue.

http://www.plantcell.org/content/28/8/1759

long story short : they're both not optimal ;-) and probably there is no optimal one (yet)....

ADD COMMENT • link 6.3 years ago by lieven.sterck 15k

0

Entering edit mode

Thanks for this reference. Makes for a good read!

ADD REPLY • link 6.1 years ago by cschu181 ★ 2.8k

score 0 · Answer 2 · 2018-03-20

0

Entering edit mode

6.1 years ago

h.mon 35k

since there is an ambiguity: the genome of my species may contain 95% of the BUSCO genes and thus the assembly is perfect, or, alternatively, the genome may contain 100% of the BUSCO genes and then the assembly is not perfect.

Now this is a mind-bender, I really can't understand this conclusion.

I think BUSCO main improvements over CEGMA are 1) the use of clade-specific genes, which allows for a greater number of genes, thus greater precision at quality estimation; and 2) use of up-to-date database. Indeed, BUSCO implements ideas the authors of CEGMA intended to implement, but didn't because lack of funding:

One planned aspect of 'CEGMA v3' was to replace the reliance on the aging KOGs database. Another aspect of the new version of CEGMA would be to develop clade-specific sets of core genes.

And:

BUSCO seems to do everything that we wanted to include in CEGMA v3 and it is based on OrthoDB, a resource that has generated a new set of orthologs (developed by the same authors).

ADD COMMENT • link 6.1 years ago by h.mon 35k

0

Entering edit mode

Thank you for your response. I'll try to reformulate in simpler words:

1) The CEGMA's protein set has a shortcoming of having too few proteins (248, to be precise)

2) The BUSCO's sets shortcoming is that they contain proteins that are single-copy in 90% of species, not 100%.

Why is it a commonplace to suppose that the second shortcoming is more negligible than the first?

ADD REPLY • link 6.1 years ago by shelkmike ★ 1.2k

1

Entering edit mode

Have a look/read of the paper I posted above ;)

1) this is a way too restrictive approach of CEGMA we learned in the meanwhile

2) being single copy in 100% of cases does not make much (biological) sense as being single copy is just a snapshot in time situation (I'm mainly talking from a plant perspective here), so SC in 100% of species will drop out lots of informative 'genes' . Nonetheless this set already covers a much bigger range of protein sequences so that is why people likely prefer BUSCO over CEGMA

ADD REPLY • link 6.1 years ago by lieven.sterck 15k

1

Entering edit mode

Thank you, I have already read the article, but haven't found a clear answer there. I supposed, maybe some of BioStars' members have a more unambiguous answer.

ADD REPLY • link 6.1 years ago by shelkmike ★ 1.2k