In simple words - what is k-mer??
4
0
Entering edit mode
3.6 years ago

Hi everyone! I'm quite new to NGS field, I'm working at the moment with 16S rRNA sequencing on Ion Torrent and I am trying to find a way to analyze my data. Everything is going +/- ok, but during alignment and taxonomic classification in Mothur I recieve many notes about my sequences that look like that:

> 1read-161813 is bad. It has no kmers of length 8. [WARNING]:
> 1read-161813 could not be classified. You can use the remove.lineage
> command with taxon=unknown; to remove such sequences.

And for one particular sample, due to this error, "unclassified" turned out to be 68 000 reads out of 160 000, which seems to me like a lot.

I've searched the internet to understand what is kmer but not sure i understand it completely. Is here anyone who could try to explain to me what is going on? >.< Can I just remove these sequences? Or should I change the kmer length from 8 to, say, 6 and try again?

Thank you!!

sequencing alignment next-gen genome • 18k views
ADD COMMENT
1
Entering edit mode

k-mer entry at WikiPedia.

all the possible substrings of length k that are contained in a string

ADD REPLY
1
Entering edit mode

You do know many specific k-mers: an hexamer is a k-mer of length 6, a dimer is a k-mer of length 2. Etc.

ADD REPLY
0
Entering edit mode

dimer, trimer, pentamer, hexamer, septamer, octomer for sure. nonomer? decamer?

ADD REPLY
0
Entering edit mode

also read Oligonucleotide Vs K-Mer - one of my favorite biostar questions

ADD REPLY
3
Entering edit mode
3.6 years ago

A kmer is just a nucleotide sequence of a certain length. For instance a dinucleotide is a kmer where k=2.

When we talk about all kmers to talk about all the possible sequences of that length. So for example, when K=2 all the possible kmers are: AA AT AC AG TA TT TC TG CA CT CC CG GA GT GC GG

K is usually bigger than 2, so we can talk about all 4mers (256 of them), all 6mers (4096 of them), all 7mers (16,384 of them) etc.

ADD COMMENT
0
Entering edit mode
3.6 years ago

From wikipedia : The term k-mer typically refers to all the possible substrings of length k that are contained in a string

https://en.wikipedia.org/wiki/K-mer

Check these 68000 reads. What are their length ? sequences ?

ADD COMMENT
0
Entering edit mode
3.6 years ago
H.Hasani ▴ 980

Hi,

k-mer can indicate low quality or contamination in your sequences. Usually you compute it by checking if there is a string of length k that occurs in the reads more than chance. Tools like Fastqc can also give you a visual representation to that and to what kind of sequence you see. The benefit of having such a qc measurement is that sometime, the adapter is not fully removed and therefore it will escape the direct test. Regarding the length, maybe can this post answer one thing or two.

hth

ADD COMMENT
0
Entering edit mode
3.6 years ago
Martombo ★ 2.7k

since nobody commented on what to do with these sequences: if a read doesn't have a kmer of length 8, it has to be shorter than 8 nucleotides (or maybe it has Ns that are removed), which means it's not going to be very informative for your analysis (even if you change the kmer length) and can be discarded.

ADD COMMENT

Login before adding your answer.

Traffic: 1095 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6