Question

LOW BUSCO Score for De novo genome assembly

0

Entering edit mode

2.8 years ago

siu ▴ 160

Hi all, I am doing the de novo genome assembly of a species of Chlamydomonas. I have 100bp long paired Illumina reads and I have checked the coverage using

C = LN/G
G = genome size of Chlamydomonas reinhardtii
L = 100bp
N = Number of reads

I am getting 50x coverage this way. I have used fastp for quality trimming of reads and used soapdenovo, velvet, abyss and spades for de novo genome assemblies using kmer length (57,67,77,87) for each of them. By this way I have 16 assemblies. But when I am running busco for quality assessment of the assemblies using

--lineage_dataset chlorophyta_odb10 -m genome --cpu 16 --augustus_species chlamydomonas

I am getting very low busco score (with every assembly) like

C:18.2%[S:2.6%,D:15.6%],F:0.1%,M:81.7%,n:1519

I am unable to check where is the problem. Please help.

Thanks

illumina genome assembly reads rna-seq • 1.6k views

ADD COMMENT • link updated 2.8 years ago by samuel.a.odonnell ▴ 520 • written 2.8 years ago by siu ▴ 160

0

Entering edit mode

Even with the complexity of these species' genomes, that is a surprisingly low busco score... As below, an important stat would be your assembly size to see if there is a lot of material missing as suggested by the busco analysis Also, depending on the distance of your species to other Chlamydomonas species, you could try aligning your genome to one and seeing how much is covered?

ADD REPLY • link 2.8 years ago by samuel.a.odonnell ▴ 520

score 1 · Answer 1 · 2021-06-16

For a complex genome such as the one you work with, it is most likely impossible to get a complete assembly with short reads and 50x coverage. I think that is what BUSCO is telling you as well. You did not provide assembly statistics, but I am guessing that you have thousands of contigs and N50 that is not very large. What is the assembly size when you apply a 1Kb cutoff? If it is considerably smaller than a reference Chlamy genome, it is no surprise that BUSCO stats are underwhelming.

You may be able to get a slightly better result by predicting genes yourself rather than letting BUSCO do it. If you use MAKER or a similar pipeline, and supplement de novo predictions with those based on comparative analysis, it may be possible to push the BUSCO scores upwards. Ultimately, I don't think you will get much better result without getting some long reads, and then you can use your existing short reads for polishing.