Forum: Behavior of short read assemblers regarding filtering out short contigs
3
gravatar for Rayan Chikhi
2.5 years ago by
Rayan Chikhi1.2k
France, Lille, CNRS
Rayan Chikhi1.2k wrote:

Hi Biostar,

This is not a question, just a small blog post. While developing a new version of Minia, I was wondering how assemblers filter out short contigs from their results. I thought maybe someone else could benefit from these observations.

SPAdes 3.5:

  • Return all contigs of length at least k+1, except:
    • short (<= max(read length, 150 bp)) isolated contigs (those not connected to any other contig)
    • isolated ones with coverage at most 2 actually that filter is disabled by default

(source: simplification.info and graph_simplification.hpp)

Velvet:

  • Return all contigs of length at least 2*k

Minia v1 and v2: same as Velvet.

 

BTW not sure if this should be a Blog or Forum post. Or just not posted here :)

minia velvet spades forum assembly • 1.1k views
ADD COMMENTlink modified 2.4 years ago by Damian Kao14k • written 2.5 years ago by Rayan Chikhi1.2k
0
gravatar for Damian Kao
2.4 years ago by
Damian Kao14k
UK
Damian Kao14k wrote:

On a semi-related note. I've been doing some abyss assemblies recently and I've noticed something strange about the contigs. One of the analysis I do with the contig assembly is to look at k-mer coverage of each contig. This is sometimes informative to see how repetitive the contigs are. I do this by:

1. Run jellyfish on the contig assembly (I used k=30) and produce counts for each k-mer.

2. For each contig, I just do a sliding window (k=30) from 1 to length - 30 and calculate a mean k-mer coverage for the contig.

3. I bound the mean k-mer coverage to 100 (anything > 100 becomes a 100) just to make visualization nicer.

4. For each contig, I plot length of the contig vs mean k-mer coverage (bounded). This is one of the plots I end up with.

This was an assembly done at k=120. There is a strange accumulation of contigs at double the k-length (240). The same thing occurs with k=70 assembly at 140. 

I noticed in the abyss assembly that during certain steps the threshold for length filtering contigs is set for at least double the k-length. Is that because of this artifact? 

ADD COMMENTlink written 2.4 years ago by Damian Kao14k

Interesting. Was your sequencing coverage slightly above 80x? It's curious that your coverages spread out that much between 20-100 with only a small peak at around 80.

2k is a special value: the length of each path in a one-nucleotide mutation bubble (SNP or sequencing error) is k nodes (if start and end nodes aren't included). The sequence length of each path is 2k-1. So for each bubble not removed by the assembler (for any reason), there is going to be a contig of length 2k-1 (or 2k, or 2k+1, depending on how the assembler includes branching nodes in contigs). 

(source: https://raw.githubusercontent.com/redayounsi/wiki/master/bub2.png)

ADD REPLYlink written 2.4 years ago by Rayan Chikhi1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1414 users visited in the last hour