Forum:Behavior of short read assemblers regarding filtering out short contigs
Entering edit mode
7.9 years ago
Rayan Chikhi ★ 1.5k

Hi Biostar,

This is not a question, just a small blog post. While developing a new version of Minia, I was wondering how assemblers filter out short contigs from their results. I thought maybe someone else could benefit from these observations.

SPAdes 3.5:

  • Return all contigs of length at least k+1, except:
    • short (<= max(read length, 150 bp)) isolated contigs (those not connected to any other contig)
    • isolated ones with coverage at most 2 actually that filter is disabled by default

(source: and graph_simplification.hpp)


  • Return all contigs of length at least 2*k

Minia v1 and v2: same as Velvet.

BTW not sure if this should be a Blog or Forum post. Or just not posted here :)

velvet spades assembly minia • 2.7k views
Entering edit mode
7.9 years ago

On a semi-related note. I've been doing some abyss assemblies recently and I've noticed something strange about the contigs. One of the analysis I do with the contig assembly is to look at k-mer coverage of each contig. This is sometimes informative to see how repetitive the contigs are. I do this by:

  1. Run jellyfish on the contig assembly (I used k=30) and produce counts for each k-mer.
  2. For each contig, I just do a sliding window (k=30) from 1 to length - 30 and calculate a mean k-mer coverage for the contig.
  3. I bound the mean k-mer coverage to 100 (anything > 100 becomes a 100) just to make visualization nicer.
  4. For each contig, I plot length of the contig vs mean k-mer coverage (bounded). This is one of the plots I end up with.

    image: plots

This was an assembly done at k=120. There is a strange accumulation of contigs at double the k-length (240). The same thing occurs with k=70 assembly at 140.

I noticed in the abyss assembly that during certain steps the threshold for length filtering contigs is set for at least double the k-length. Is that because of this artifact?

Entering edit mode

Interesting. Was your sequencing coverage slightly above 80x? It's curious that your coverages spread out that much between 20-100 with only a small peak at around 80.

2k is a special value: the length of each path in a one-nucleotide mutation bubble (SNP or sequencing error) is k nodes (if start and end nodes aren't included). The sequence length of each path is 2k-1. So for each bubble not removed by the assembler (for any reason), there is going to be a contig of length 2k-1 (or 2k, or 2k+1, depending on how the assembler includes branching nodes in contigs).



Login before adding your answer.

Traffic: 2738 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6