I've just used ABySS to build a de novo assembly from some 2x150 PE sequence data (targets are probe-captured fragments from across the genome). My inserts are, on average, approximately 270bp long. An outcome that I didn't expect (but make sense in retrospect) is that the output from ABySS contains a very large number of contigs all the way down to my chosen k-mer size, which is 46bp. I'm having trouble understanding the best way to handle these, given that they should provide less information than the reads themselves. Should I keep them in for downstream analysis (SNP calling, mainly)? Should I discard any contigs below my read size? Below my insert size?
In contrast, I also performed an assembly using MaSuRCA, which produced similar results (i.e. ~same N50 and L50, ignoring contigs below 500bp in ABySS's case), but an order of magnitude fewer contigs - I think because the coverage filter is set to a more aggressive default, so all the k-mers are discarded during assembly (though I'd have to double-check where the two differ).
[EDIT] This is reduced-representation sequencing of a eukaryote (insect with genome size ~500Mbp), i.e. it's not intended to be a high-quality assembly. The intended use is for population genomics, so the idea is just to have a large number of homologous loci. My assembly is based on a handful of samples to provide an "internal" reference to which all samples can then be aligned - which is part of why my instinct is to discard anything below the read length.
Any insights super appreciated! Thanks.
Sorry, I should have provided more information. The data are from a eukaryote (an insect, genome size probably around 500Mbp), and were obtained from old/degraded samples, so smaller contigs and spottier coverage are to be expected. But I'm not super concerned about how well I've covered the genome, as this is a population genomics project. N50 is only about 587. Not sure about coverage, but certainly not that high.
Yes, you kind of left out a very important piece of information. The N50 value indicates a bad assembly, but maybe that's the best one can get with your sample. I don't have first-hand experience with this kind of assembly, but with such a low N50 you may want to keep everything that is your read length (150 bp) or larger.