Question: When should you de novo assemble a whole-genome, and when should you simply align it to a reference?
gravatar for olavur
3.3 years ago by
T├│rshavn, Faroe Islands
olavur100 wrote:

Say I have sequenced the whole-genome of an individual or set of individuals. I have some task at hand, and need to decide whether I want to just align the reads to e.g. GRCh38, or if I want to de novo assemble each whole-genome. I imagine there are pros and cons with both methods, and as such which method I should choose depends on the task at hand. Is this the case? What are the differences?

ADD COMMENTlink modified 3.3 years ago by Istvan Albert ♦♦ 85k • written 3.3 years ago by olavur100

What task do you want to fulfill? I think it really depends on your task, for example de novo assembly will make you loose information about variants or depth whereas read alignment would possibly lead to mistakes if you have a lot af repeat regions...

Many studies are going with the two approaches in parallel

ADD REPLYlink written 3.3 years ago by vmicrobio250

Ok, so de novo assembly is good for finding for example structural variants and large CNVs, but alignment to a reference is better for SNPs and small indels and CNVs.

Can you elaborate a little bit on how information about depth is lost?

Using both approaches in parallel makes a lot of sense, if one wants to find as many types of variation as possible.

ADD REPLYlink written 3.3 years ago by olavur100

de novo assembly will generate you multifasta files containing your contigs, you'll have larger fragments but without information about variation and depth at a position. However you can retrieve these informations from the alignment you'll do in parallel

ADD REPLYlink written 3.2 years ago by vmicrobio250
gravatar for Istvan Albert
3.3 years ago by
Istvan Albert ♦♦ 85k
University Park, USA
Istvan Albert ♦♦ 85k wrote:

De novo assembly would be used primarily when we expect to see large-scale variations that are not present in the reference genome.

Or alternatively when the variations are such that aligning the reads to the reference genome would produce confusing or ambiguous alignments from which we would be unable to correctly reconstruct the original sequence.

ADD COMMENTlink written 3.3 years ago by Istvan Albert ♦♦ 85k

What would cause a read to cause confusing or ambiguous alignments? You're not just talking about structural variants and large CNVs here?

ADD REPLYlink written 3.3 years ago by olavur100


With shorter reads (e.g. Illumina) sometimes reads get aligned randomly due to high similarity between genes. E.g. if homologues share 99% identity, the mapper is not able to tell where the reads should be aligned and therefore does this randomly. As a result it looks like the two homologues might have heterozygous SNPs, but Sanger sequencing will confirm that this is not the case. With longer reads (e.g. PacBio) this particular issue is not present as the reads span the whole ORF.

ADD REPLYlink written 3.2 years ago by yeastngs10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1160 users visited in the last hour