Question: How to count and compare k-mer count vectors (and print the top ten highest contributions)?
gravatar for nihaldm169
4 weeks ago by
nihaldm1690 wrote:


For a project I got two FASTA samples of amino acid sequences. I should write a function that compares the two samples and then prints out into a file the k-mers with the top ten highest contributions (length of k = 12). How can I do this, maybe using k-mer count vectors? Or what method is useful for this?

Happy about any help, also links to webpages or tutorials!


ADD COMMENTlink modified 4 weeks ago by Joe15k • written 4 weeks ago by nihaldm1690

How big are the sequences you need to assess? (I.e. does the speed of the solution matter?)

ADD REPLYlink written 4 weeks ago by Joe15k

The speed doesn't matter, no. And I don't know how big the sequences are... We got sent two files with the sequences yet in them. But I need to compare sequences of 12 letters length.

ADD REPLYlink written 4 weeks ago by nihaldm1690
gravatar for Joe
4 weeks ago by
United Kingdom
Joe15k wrote:

Here's a solution:

import re, sys
from collections import Counter
from Bio import SeqIO

k = 12
n = 10
kmers = re.compile("(?=(\w{%s}))" % k)

for rec in SeqIO.parse(sys.argv[1], 'fasta'):
    kseqs = kmers.findall(str(rec.seq))
    counts = Counter(kseqs)
    top_n = counts.most_common(n)

This prints the top n kmers of length k for each each sequence in the given fasta.

On a 5Mb bacterial genome this takes:

real    0m6.697s
user    0m6.060s
sys 0m0.624s

NB I haven't sanity checked the ouput, nor have I 'prettified' it, so use the above code advisedly and please do your own sanity checking. This is also not a super memory efficient script, so just beware.

I'm not sure what you can do to compare them personally. You will likely need to normalise the results for the length of the input sequence though, if they arent identical. You'll need to look for what tests you can do with categorical data though.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by Joe15k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1398 users visited in the last hour