Question

How do i assign a kmer value?

1

Entering edit mode

7.9 years ago

Seigfried ▴ 80

Hi there I wish to ask a basic question to which I really didn't find an answer to. Please guide me.

I have 250 bp read length deep seq data. I wish to run Error Correction tools on it like SOAP Error Corrector or Quake. All of these programs ask for a kmer value. Even Jellyfish asks for a kmer value to estimate genome size.

Now I used kmergenie to find the optimum kmer value which I guess I did wrong because my kmer value exceeds 180. Please correct me if I am wrong but this 180 value I got would be ideal for a de-novo ASSEMBLY right? Error correction would need a much shorter kmer value.

In many research papers I have read they have assigned a kmer value like 17, 21, 23 and so on. This is for 150 bp reads. I just want to know where they got these values from. What should I use if i have 250 bp reads?

Niranjan

kmergenie SOAP_EC Quake Jellyfish • 3.7k views

ADD COMMENT • link 7.9 years ago by Seigfried ▴ 80

score 5 · Answer 1 · 2016-05-24

Typically, the longer the kmer, the better, up to some point where the kmer depth becomes too low. The kmer depth is less than the read depth. Let's say reads are length length R, and you have a read coverage depth of DR; for kmers of length K, you have kmer depth of DK:

DK = DR*(R-K+1)/R

So for example, if you have 250bp reads and 100x coverage, with K=31, it yields:

100*(250-31+1)/250 = 88

The formula assumes error-free reads, so in practice, the kmer depth (of correct kmers) will be somewhat lower. For assembly, a kmer depth of around 40x is usually sufficient, depending on factors like coverage variability. So, try using the longest kmer you can while maintaining sufficient coverage. In practice, though, the best way to find the optimal kmer length for assembly is typically to assemble with multiple kmer lengths and pick the assembly with the highest continuity.

Note that for error-correction, you must use a kmer of at most half the read length to correct the entire read. If you use K=180 with 250bp reads, you won't be able to correct the middle of the read; you should use something under K=125 in that situation.

score 0 · Answer 2 · 2016-05-24

Keep in mind that many older kmer-based error correction tools like Quake and SOAPec assume a simple kmer distribution, where there is relatively low heterozygosity (so one significant peak corresponding to highest kmer coverage). If you have a library prepped from a particularly heterozygous or polyploid sample the kmer distribution will be more complex you may have issues with error correction you may want to look at alternative error correction tools such as Lighter. Of course, if you have such data you will also have downstream difficulties with assembly and analysis, but that's another issue altogether ...

score 0 · Answer 3 · 2016-05-25

I am dealing with a plant diploid so I expect a huge amount of repeats in my data.

For example I see 2 distinct peaks in the data at a kmer value of 21. The second peak should be the repeat sequences and its kmer depth was twice the first peak's kmer depth.

These 2 peaks stayed in the data till around a kmer value of 120 where the 2 peaks merged into 1 peak. However the kmer depth here was only around 30. At a kmer value of 100 I am getting a kmer depth of 40 which is what I ideally want.

I will try running different kmer values and see which ones give the best assembly result. Thanks @Brian Bushnell and @Chris Fields