Question

Kmer selection for bacterial WGS denovo assembly using SPAdes or SOAP-denovo

0

Entering edit mode

4.9 years ago

bioinforesearchquestions ▴ 370

Hi friends,

We have WGS data for a bacterial sample and the read length is 75bp (paired-end) with more than 200X coverage. Trimmed reads ranging between 20-75bp.

I am going to try denovo assembly using SPAdes3.13.1 and SOAP-denovo assemblers.

What criteria should be used to select the Kmers for assembly?

Assembly SPAdes Soap-denovo WGS bacterial • 2.4k views

ADD COMMENT • link updated 4.9 years ago by h.mon 35k • written 4.9 years ago by bioinforesearchquestions ▴ 370

score 2 · Answer 1 · 2019-06-02

2

Entering edit mode

4.9 years ago

h.mon 35k

For SPAdes, it is best to let the assembler pick the kmer sizes automatically.

For SOAPdenovo, a good starting point is 2/3 of the maximal read length. There are some BioStars posts on the subject, by the way:

How To Choose The K Value Of Kmer In Soapdenovo?

Kmergenie k-mer estimate and multiple k-mers

using soap de novo assembly

Guidelines to choose K-mer size for De bruijn graph based assembly (2nd generation sequencing reads)?

ADD COMMENT • link 4.9 years ago by h.mon 35k

0

Entering edit mode

Hi h.mon, I am currently using SPAdes with default settings and auto is the default one for kmer. I have more than 430 million reads for a bacterial sample. Do you know any tool subset/downsample them to lesser coverage?

ADD REPLY • link 4.9 years ago by bioinforesearchquestions ▴ 370

2

Entering edit mode

Don't subset, use digital normalization, which is a better technique to reduce coverage without loosing information. There are several packages which perform digital normalization, I use BBNorm (from BBTools package) when I need to.

If you really want to down-sample, you can use reformat.sh (from the same BBTools package). For example, to down-sample to 10% of the original reads:

reformat.sh samplerate=0.1 in=original.fastq out=downsampled.fastq

ADD REPLY • link 4.9 years ago by h.mon 35k

0

Entering edit mode

Hi h.mon, I am aiming to reduce the coverage from 10000X to 1000X. so in my case, I need to do digital normalization using BBNorm rather than reformat.sh(downsample).

bbnorm.sh in=reads.fq out=normalized.fq target=1000 min=30

What "min" is reasonable to get 1000X coverage?

ADD REPLY • link 4.9 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

I have asked a different set of questions in the post (Should I consider contigs.fa or scaffolds.fa from SPAdes output for downstream analyses?) that are related to this post

ADD REPLY • link 4.8 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Hi h.mon,

using BBNorm, can we downsample to a specific read coverage? I saw the target option in BBNorm is about the kmer coverage. How much should I keep for the target option in order to get 100X read coverage?

ADD REPLY • link 4.8 years ago by bioinforesearchquestions ▴ 370

1

Entering edit mode

As far as I can think of, one can't down-sample straight to a target read coverage without an assembled genome, so you have to content yourself with kmer coverage. Use target=100 and, after assembly, map the reads and check if you got the expected coverage, then adjust target as needed - but I expect it would be close enough. As reads may contain errors, I expect target=100 will end up with slight higher read coverage.

However, why do you want to do this? de Bruijin assemblers measure coverage in kmers, not reads.

ADD REPLY • link 4.8 years ago by h.mon 35k