I'm working with human genomes that are between 4-8X coverage. Unfortunately, I have de novo assembled them using SGA and have done some QC on the assemblies using QUAST and MultiQC, and the results are far from ideal. I've attached a screenshot from the MultiQC report so you can see what I mean (please note that the two assemblies with noticeably better metrics in the report are high coverage genomes at 30X). In particular, the "Genome Fraction" metric that gives the percentage of the reference genome that the assembled genome aligns to is extremely low, as is the total length.
My question is whether this is expected for low coverage genomes (4-8X)? Or if something has gone wrong somewhere? That being said though, I used the exact same processes on the high coverage genomes in the report, and those genomes' metrics are fine. Essentially, I've never worked with human genomes before, and I just want some guidance as to whether this data would actually be acceptable to use for later analysis, or whether these metrics justify excluding the genomes from further analysis. At the bottom of the post, I have explained how I assembled the genomes for those who want more information on my processes.
Thanks in advance!
I downloaded the sequence files from the EGA in .bam format, and each file had an associated md5 checksum so I know the files were downloaded correctly. Each individual genome had been split into 6 .bam files so I used the Samtools merge and index functions to merge the .bams into a single .bam for each genome (with associated .bai). I then converted the .bam files to fastq using the Samtools fastq function. I finally de novo assembled the genomes using SGA, and then ran QUAST and MultiQC on the assemblies.