Question: Representing a sequence as k-mer composition
gravatar for mdurrant
2.7 years ago by
mdurrant10 wrote:


Could someone please explain this paragraph from from McHardy et al., 2007:

Compositional sequence patterns.
For ompositional feature analysis, we map a given piece of DNA sequence to a higher-dimensional space of nucleotide patterns o = {o1, o2, ..., oq}, where o is defined by the pattern length w and the number of literals l. In this space, s is represented by the compositional input vector v = (a1, a2, ..., aq); where ai is the frequency of pattern oi in s. Input vectors are normalized by the total number of patterns for each sequence.

I specifically want to understand how they would generate o from a given pattern length w and number of literals l. How would this be applied to an example DNA sequence?

dna kmer sequence • 862 views
ADD COMMENTlink modified 2.6 years ago by marsvetlana10 • written 2.7 years ago by mdurrant10
gravatar for marsvetlana
2.6 years ago by
marsvetlana10 wrote:

Hi! w - is "word" length and l - is a number of "letters": in the "alphabet". Usually, there are four letters in DNA alphabet (a,c,t,g), w is defined by researcher. If for example, w = 2, with l = {a,c,t,g} we have o = {aa,ac,at,ag,ca,cc,ct,cg,ta,tc,tt,tg,ga,gc,gt,gg}. In this case, any DNA sequence can be characterized by 16 numbers, each of them represent frequency or number of occurence of one of these patterns (k-mers/motifs/words). The quantity of possible patterns is W in the power of L.
Hope it helps

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by marsvetlana10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2067 users visited in the last hour