Forum:Behavior of short read assemblers regarding filtering out short contigs
1
4
Entering edit mode
6.4 years ago
Rayan Chikhi ★ 1.5k

Hi Biostar,

This is not a question, just a small blog post. While developing a new version of Minia, I was wondering how assemblers filter out short contigs from their results. I thought maybe someone else could benefit from these observations.

SPAdes 3.5:

  • Return all contigs of length at least k+1, except:
    • short (<= max(read length, 150 bp)) isolated contigs (those not connected to any other contig)
    • isolated ones with coverage at most 2 actually that filter is disabled by default

(source: simplification.info and graph_simplification.hpp)

Velvet:

  • Return all contigs of length at least 2*k

Minia v1 and v2: same as Velvet.

 

BTW not sure if this should be a Blog or Forum post. Or just not posted here :)

assembly spades velvet Forum minia • 2.3k views
ADD COMMENT
1
Entering edit mode
6.4 years ago

On a semi-related note. I've been doing some abyss assemblies recently and I've noticed something strange about the contigs. One of the analysis I do with the contig assembly is to look at k-mer coverage of each contig. This is sometimes informative to see how repetitive the contigs are. I do this by:

1. Run jellyfish on the contig assembly (I used k=30) and produce counts for each k-mer.

2. For each contig, I just do a sliding window (k=30) from 1 to length - 30 and calculate a mean k-mer coverage for the contig.

3. I bound the mean k-mer coverage to 100 (anything > 100 becomes a 100) just to make visualization nicer.

4. For each contig, I plot length of the contig vs mean k-mer coverage (bounded). This is one of the plots I end up with.

This was an assembly done at k=120. There is a strange accumulation of contigs at double the k-length (240). The same thing occurs with k=70 assembly at 140. 

I noticed in the abyss assembly that during certain steps the threshold for length filtering contigs is set for at least double the k-length. Is that because of this artifact? 

ADD COMMENT
0
Entering edit mode

Interesting. Was your sequencing coverage slightly above 80x? It's curious that your coverages spread out that much between 20-100 with only a small peak at around 80.

2k is a special value: the length of each path in a one-nucleotide mutation bubble (SNP or sequencing error) is k nodes (if start and end nodes aren't included). The sequence length of each path is 2k-1. So for each bubble not removed by the assembler (for any reason), there is going to be a contig of length 2k-1 (or 2k, or 2k+1, depending on how the assembler includes branching nodes in contigs). 

(source: https://raw.githubusercontent.com/redayounsi/wiki/master/bub2.png)

ADD REPLY

Login before adding your answer.

Traffic: 2073 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6