Question: When a duplicated BUSCOs value is too high? Reasons and how to improve it
resug10 wrote:

Hi, I have a de novo whole-genome assembly of a plant genome with a BUSCO score of:



1568 Complete BUSCOs (C) 1398 Complete and single-copy BUSCOs (S) 170 Complete and duplicated BUSCOs (D) 18 Fragmented BUSCOs (F) 28 Missing BUSCOs (M) 1614 Total BUSCO groups searched

Is 10.5% of complete and duplicated BUSCOs (D) too high or something that I should be worried about? If so, what are the reasons for getting high D values and how can I reduce/fix it? Thanks.

VIB, Ghent, Belgium
lieven.sterck10.0k wrote:

Without knowing the details I would say, no, that's not too high, nor a worrying result.

It's very well known that in plants you often have a substantial amount of duplicated genes, and 10% is not high in that respect. If there is some redundancy in your assembly that can also increase the amount of the duplicated fraction but again, I don't think that 10% is very high.

Bottom line, to me this is a totally accepted value/result of the BUSCO analysis.

Thanks! It's good to hear this. Are not BUSCOs expected to be single copy?

well yes and no :)

have a look at the details how those BUSCO groups are build. If I remember correctly they need to be single copy in at least x% of the species under investigation. If you would be very stringent (== single copy in all species) you'll end up with only a few dozens to hundreds (we once did this exercise) . This is mainly due to the nature of many of those plant species, (ancient) polyploids. If you take the poplar case for instance, nearly all genes are still present in duplicate in that one and as such you would not keep many true single copy ones for the BUSCO.

Very interesting information. Thanks for sharing your knowledge. Now the answer to this question is clear.

