Question

de novo assembly: should I discard contigs below insert size?

1

Entering edit mode

3.7 years ago

smo ▴ 20

I've just used ABySS to build a de novo assembly from some 2x150 PE sequence data (targets are probe-captured fragments from across the genome). My inserts are, on average, approximately 270bp long. An outcome that I didn't expect (but make sense in retrospect) is that the output from ABySS contains a very large number of contigs all the way down to my chosen k-mer size, which is 46bp. I'm having trouble understanding the best way to handle these, given that they should provide less information than the reads themselves. Should I keep them in for downstream analysis (SNP calling, mainly)? Should I discard any contigs below my read size? Below my insert size?

In contrast, I also performed an assembly using MaSuRCA, which produced similar results (i.e. ~same N50 and L50, ignoring contigs below 500bp in ABySS's case), but an order of magnitude fewer contigs - I think because the coverage filter is set to a more aggressive default, so all the k-mers are discarded during assembly (though I'd have to double-check where the two differ).

[EDIT] This is reduced-representation sequencing of a eukaryote (insect with genome size ~500Mbp), i.e. it's not intended to be a high-quality assembly. The intended use is for population genomics, so the idea is just to have a large number of homologous loci. My assembly is based on a handful of samples to provide an "internal" reference to which all samples can then be aligned - which is part of why my instinct is to discard anything below the read length.

Any insights super appreciated! Thanks.

assembly abyss next-gen • 2.1k views

ADD COMMENT • link 3.7 years ago by smo ▴ 20

score 0 · Answer 1 · 2022-03-03

0

Entering edit mode

3.7 years ago

Mensur Dlakic ★ 30k

For prokaryotic (meta)genome assemblies, I would under no circumstances keep the contigs smaller than 1 KB. In most cases I throw away contigs smaller than 2 KB as they are usually fragments that are contained in larger contigs, or they were kept from extending into larger contigs by sequencing errors. At the very least it seems prudent to ignore contigs smaller than 500 bp, even though most people have hard time with a concept of throwing away the data. For contigs that mostly contain noise, that's actually a good thing for downstream applications.

I don't know what you are after other than SNP calling, so it is difficult to give an advice. Generally speaking, there is very little useful information in those short contigs. Even the "mutations" you see in them may be nothing but sequencing errors due to deep coverage. Speaking of which, what is the average coverage of your assembly? I greater than 200-300x, and for sure if greater than 500x, you may be seeing some sequencing error artefacts in your assembly.

ADD COMMENT • link 3.7 years ago by Mensur Dlakic ★ 30k

0

Entering edit mode

Sorry, I should have provided more information. The data are from a eukaryote (an insect, genome size probably around 500Mbp), and were obtained from old/degraded samples, so smaller contigs and spottier coverage are to be expected. But I'm not super concerned about how well I've covered the genome, as this is a population genomics project. N50 is only about 587. Not sure about coverage, but certainly not that high.

ADD REPLY • link 3.7 years ago by smo ▴ 20

0

Entering edit mode

Yes, you kind of left out a very important piece of information. The N50 value indicates a bad assembly, but maybe that's the best one can get with your sample. I don't have first-hand experience with this kind of assembly, but with such a low N50 you may want to keep everything that is your read length (150 bp) or larger.

ADD REPLY • link 3.7 years ago by Mensur Dlakic ★ 30k