Question: k-mer finding and counting
1
gravatar for sami
5 weeks ago by
sami20
sami20 wrote:

I have a fastq file. I want to have k-mers and their abundance numbers. can you give me some tools to do this! thank you

k-mer • 242 views
ADD COMMENTlink modified 5 weeks ago by wljlinksly50 • written 5 weeks ago by sami20

What were some tools you came across when you googled how to do this?

ADD REPLYlink written 5 weeks ago by RamRS30k
1

i have not google it , i ll try

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by sami20
3
gravatar for Alex Reynolds
5 weeks ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

If you're working with small values of k, I wrote a command-line k-mer counter called kmer-counter that will output results in a form that a Python script can consume:

https://github.com/alexpreynolds/kmer-counter

You can grab, build and install it like so:

$ git clone https://github.com/alexpreynolds/kmer-counter.git
$ cd kmer-counter
$ make
$ cp kmer-counter /usr/local/bin

Once the binary is in your path, you might use it in Python like so:

k = 6
fastaFile = '/path/to/some/seqs.fa'
kmerCmd = 'kmer-counter --fasta --k=%d %s' % (k, fastaFile)
try:
    output = subprocess.check_output(kmerCmd, shell=True)
    result = {}
    for line in output.splitlines():
        (header, counts) = line.strip().split('\t')
        header = header[1:]
        kmers = dict((k,int(v)) for (k,v) in [d.split(':') for d in counts.split(' ')])
        result[header] = kmers
    sys.stdout.write("%s" % (str(result)))
except subprocess.CalledProcessError as error:
    sys.stderr.write("%s" % (str(error)))

Given example FASTA like this:

>foo
TTAACG
>bar
GTGGAAGTTCTTAGGGCATGGCAAAGAGTCAGAATTTGAC

For k=6, you would get an iterable Python dictionary like this:

{'foo': {'TTAACG': 1, 'CGTTAA': 1}, 'bar': {'GTTCTT': 1, 'AGAACT': 1, 'GAGTCA': 1, 'ATGGCA': 1, 'GAACTT': 1, 'ATTCTG': 1, 'CTAAGA': 1, 'CTTCCA': 1, 'ATTTGA': 1, 'GGAAGT': 1, 'AGGGCA': 1, 'CCTAAG': 1, 'CTCTTT': 1, 'AATTTG': 1, 'TCTGAC': 1, 'TTTGCC': 1, 'CTTAGG': 1, 'TTTGAC': 1, 'GAAGTT': 1, 'CCCTAA': 1, 'AGAATT': 1, 'AGTCAG': 1, 'CTGACT': 1, 'TCTTAG': 1, 'CGTTAA': 1, 'GTGGAA': 1, 'TGCCAT': 1, 'ACTCTT': 1, 'GGGCAT': 1, 'TTAGGG': 1, 'CTTTGC': 1, 'TGGAAG': 1, 'GACTCT': 1, 'CATGCC': 1, 'GCAAAG': 1, 'AAATTC': 1, 'GTCAAA': 1, 'TGACTC': 1, 'TAGGGC': 1, 'AAGTTC': 1, 'ATGCCC': 1, 'TCAAAT': 1, 'CAAAGA': 1, 'AACTTC': 1, 'GTCAGA': 1, 'CAAATT': 1, 'TAAGAA': 1, 'CATGGC': 1, 'AAGAAC': 1, 'AAGAGT': 1, 'TCTTTG': 1, 'TTCCAC': 1, 'TGGCAA': 1, 'GGCAAA': 1, 'AGTTCT': 1, 'AGAGTC': 1, 'TCAGAA': 1, 'GAATTT': 1, 'AAAGAG': 1, 'TGCCCT': 1, 'CCATGC': 1, 'GGCATG': 1, 'TTGCCA': 1, 'CAGAAT': 1, 'AATTCT': 1, 'GCATGG': 1, 'ACTTCC': 1, 'TTCTTA': 1, 'GCCATG': 1, 'GCCCTA': 1, 'TTCTGA': 1}}

You can use standard Python calls to manipulate this dictionary object and get sums of counts per record, for sequence, etc. which seems to answer your question. Please feel free to clarify what you're looking for if this object representation is not clear.

For a fully-fleshed out demonstration, see: https://github.com/alexpreynolds/kmer-counter/blob/master/test/kmer-test.py

If you are working with large values of k, then you might look at other tools, like Jellyfish. They use data structures to give an approximate count for large values of k that are not tenable for strict word counts.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by Alex Reynolds30k
3
gravatar for kristina.mahan
5 weeks ago by
kristina.mahan100 wrote:

You can use Jellyfish, KMC3, kmerfreq, BFCounter, ntCard . . .

ADD COMMENTlink written 5 weeks ago by kristina.mahan100
3
gravatar for wljlinksly
5 weeks ago by
wljlinksly50
wljlinksly50 wrote:

i would prefer this one: https://github.com/lh3/yak by Heng Li

ADD COMMENTlink written 5 weeks ago by wljlinksly50

why they use always 31 as maximum size of kmer ?

ADD REPLYlink written 5 weeks ago by sami20
1

A canonical 31-mer can be represented with a 64-bit unsigned integer. Larger word lengths could be used at the expense of more memory and time, so that's a common cutoff when using that data structure to store a kmer key for presence/absence or counts.

If you use a larger k, tools like Jellyfish use a data structure called a Bloom filter (https://en.wikipedia.org/wiki/Bloom_filter) to be able to manage performance and storage, with the trade-off of giving an inexact count.

ADD REPLYlink modified 4 weeks ago • written 5 weeks ago by Alex Reynolds30k

Where is that specified?

ADD REPLYlink written 5 weeks ago by RamRS30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 726 users visited in the last hour