Question

Human de novo genome assembly resulting in bad quality assemblies

0

Entering edit mode

2.8 years ago

Jess • 0

Hi everyone!

I'm working with human genomes that are between 4-8X coverage. Unfortunately, I have de novo assembled them using SGA and have done some QC on the assemblies using QUAST and MultiQC, and the results are far from ideal. I've attached a screenshot from the MultiQC report so you can see what I mean (please note that the two assemblies with noticeably better metrics in the report are high coverage genomes at 30X). In particular, the "Genome Fraction" metric that gives the percentage of the reference genome that the assembled genome aligns to is extremely low, as is the total length.

MultiQC report for some of the genomes

My question is whether this is expected for low coverage genomes (4-8X)? Or if something has gone wrong somewhere? That being said though, I used the exact same processes on the high coverage genomes in the report, and those genomes' metrics are fine. Essentially, I've never worked with human genomes before, and I just want some guidance as to whether this data would actually be acceptable to use for later analysis, or whether these metrics justify excluding the genomes from further analysis. At the bottom of the post, I have explained how I assembled the genomes for those who want more information on my processes.

Thanks in advance!

Jess

I downloaded the sequence files from the EGA in .bam format, and each file had an associated md5 checksum so I know the files were downloaded correctly. Each individual genome had been split into 6 .bam files so I used the Samtools merge and index functions to merge the .bams into a single .bam for each genome (with associated .bai). I then converted the .bam files to fastq using the Samtools fastq function. I finally de novo assembled the genomes using SGA, and then ran QUAST and MultiQC on the assemblies.

sga de human genome novo low coverage assembly • 856 views

ADD COMMENT • link 2.8 years ago by Jess • 0

0

Entering edit mode

Is there a reason you are trying to assemble rather than align? 4-8X coverage is not enough to get good genome assemblies.

ADD REPLY • link 2.8 years ago by GenoMax 141k

0

Entering edit mode

Yes, my project has to do with finding non-reference sequences in specific populations. I have genomes from many different populations and they are mostly 30X or higher, but certain populations that I need to include just don't have deeply sequenced genomes available for research. So I've had to resort to low coverage ones, otherwise certain populations just won't be represented at all.

ADD REPLY • link 2.8 years ago by Jess • 0

1

Entering edit mode

Assembly is not my field at all, so I am just guessing, but wouldn't it make more sense to stringently align these data to an excellent (hg38, or the new T2T) reference, then take the unmapped reads and try to make some sense out of them? I mean, from the recent T2T we have seen what an effort you have to put in for humans to really get a comprehensive assembly, and this included state-of-the-art long read technology and proper depth. Not sure whether collecting published low-depth samples and assembling them independently makes sense. Can't you at least pool all samples of a population to get higher depth? As said not my field, just guessing.

ADD REPLY • link 2.8 years ago by ATpoint 82k

0

Entering edit mode

So that is exactly what I am doing - but I'm comparing two different methods. The one method de novo assembles the whole genomes and then aligns them to hg38 and then focusses on the unaligned sequences. The second method takes the sequencing reads and looks for only unaligned ones, and then assembles only those. One aim of my project is to compare the results. But I need to decide whether it is worth it to even include these genomes when this is the quality I'm getting.

ADD REPLY • link 2.8 years ago by Jess • 0