Question

Forum:Behavior of short read assemblers regarding filtering out short contigs

4

Entering edit mode

9.0 years ago

Rayan Chikhi ★ 1.5k

Hi Biostar,

This is not a question, just a small blog post. While developing a new version of Minia, I was wondering how assemblers filter out short contigs from their results. I thought maybe someone else could benefit from these observations.

SPAdes 3.5:

Return all contigs of length at least k+1, except:
- short (<= max(read length, 150 bp)) isolated contigs (those not connected to any other contig)
- ~~isolated ones with coverage at most 2~~ actually that filter is disabled by default

(source: simplification.info and graph_simplification.hpp)

Velvet:

Return all contigs of length at least 2*k

Minia v1 and v2: same as Velvet.

BTW not sure if this should be a Blog or Forum post. Or just not posted here :)

velvet spades assembly minia • 3.0k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 9.0 years ago by Rayan Chikhi ★ 1.5k

Ram · Answer 1 · 2015-05-11

On a semi-related note. I've been doing some abyss assemblies recently and I've noticed something strange about the contigs. One of the analysis I do with the contig assembly is to look at k-mer coverage of each contig. This is sometimes informative to see how repetitive the contigs are. I do this by:

Run jellyfish on the contig assembly (I used k=30) and produce counts for each k-mer.
For each contig, I just do a sliding window (k=30) from 1 to length - 30 and calculate a mean k-mer coverage for the contig.
I bound the mean k-mer coverage to 100 (anything > 100 becomes a 100) just to make visualization nicer.
For each contig, I plot length of the contig vs mean k-mer coverage (bounded). This is one of the plots I end up with.

This was an assembly done at k=120. There is a strange accumulation of contigs at double the k-length (240). The same thing occurs with k=70 assembly at 140.

I noticed in the abyss assembly that during certain steps the threshold for length filtering contigs is set for at least double the k-length. Is that because of this artifact?