Question: Representing a sequence as k-mer composition
1
2.7 years ago by
mdurrant10
mdurrant10 wrote:

Hello,

Could someone please explain this paragraph from from McHardy et al., 2007:

Compositional sequence patterns.
For ompositional feature analysis, we map a given piece of DNA sequence to a higher-dimensional space of nucleotide patterns o = {o1, o2, ..., oq}, where o is defined by the pattern length w and the number of literals l. In this space, s is represented by the compositional input vector v = (a1, a2, ..., aq); where ai is the frequency of pattern oi in s. Input vectors are normalized by the total number of patterns for each sequence.

I specifically want to understand how they would generate o from a given pattern length w and number of literals l. How would this be applied to an example DNA sequence?

dna kmer sequence • 862 views
modified 2.6 years ago by marsvetlana10 • written 2.7 years ago by mdurrant10
1
2.6 years ago by
marsvetlana10
marsvetlana10 wrote:

Hi! w - is "word" length and l - is a number of "letters": in the "alphabet". Usually, there are four letters in DNA alphabet (a,c,t,g), w is defined by researcher. If for example, w = 2, with l = {a,c,t,g} we have o = {aa,ac,at,ag,ca,cc,ct,cg,ta,tc,tt,tg,ga,gc,gt,gg}. In this case, any DNA sequence can be characterized by 16 numbers, each of them represent frequency or number of occurence of one of these patterns (k-mers/motifs/words). The quantity of possible patterns is W in the power of L.
Hope it helps