Question: Representing a sequence as k-mer composition
gravatar for mdurrant
23 months ago by
mdurrant10 wrote:


Could someone please explain this paragraph from from McHardy et al., 2007:

Compositional sequence patterns.
For ompositional feature analysis, we map a given piece of DNA sequence to a higher-dimensional space of nucleotide patterns o = {o1, o2, ..., oq}, where o is defined by the pattern length w and the number of literals l. In this space, s is represented by the compositional input vector v = (a1, a2, ..., aq); where ai is the frequency of pattern oi in s. Input vectors are normalized by the total number of patterns for each sequence.

I specifically want to understand how they would generate o from a given pattern length w and number of literals l. How would this be applied to an example DNA sequence?

dna kmer sequence • 687 views
ADD COMMENTlink modified 22 months ago by marsvetlana10 • written 23 months ago by mdurrant10
gravatar for marsvetlana
22 months ago by
marsvetlana10 wrote:

Hi! w - is "word" length and l - is a number of "letters": in the "alphabet". Usually, there are four letters in DNA alphabet (a,c,t,g), w is defined by researcher. If for example, w = 2, with l = {a,c,t,g} we have o = {aa,ac,at,ag,ca,cc,ct,cg,ta,tc,tt,tg,ga,gc,gt,gg}. In this case, any DNA sequence can be characterized by 16 numbers, each of them represent frequency or number of occurence of one of these patterns (k-mers/motifs/words). The quantity of possible patterns is W in the power of L.
Hope it helps

ADD COMMENTlink modified 22 months ago • written 22 months ago by marsvetlana10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 752 users visited in the last hour