k-mer finding and counting
3
2
Entering edit mode
3.7 years ago
sami ▴ 40

I have a fastq file. I want to have k-mers and their abundance numbers. can you give me some tools to do this! thank you

k-mer • 5.9k views
ADD COMMENT
0
Entering edit mode

What were some tools you came across when you googled how to do this?

ADD REPLY
1
Entering edit mode

i have not google it , i ll try

ADD REPLY
3
Entering edit mode
3.7 years ago

If you're working with small values of k, I wrote a command-line k-mer counter called kmer-counter that will output results in a form that a Python script can consume:

https://github.com/alexpreynolds/kmer-counter

You can grab, build and install it like so:

$ git clone https://github.com/alexpreynolds/kmer-counter.git
$ cd kmer-counter
$ make
$ cp kmer-counter /usr/local/bin

Once the binary is in your path, you might use it in Python like so:

k = 6
fastaFile = '/path/to/some/seqs.fa'
kmerCmd = 'kmer-counter --fasta --k=%d %s' % (k, fastaFile)
try:
    output = subprocess.check_output(kmerCmd, shell=True)
    result = {}
    for line in output.splitlines():
        (header, counts) = line.strip().split('\t')
        header = header[1:]
        kmers = dict((k,int(v)) for (k,v) in [d.split(':') for d in counts.split(' ')])
        result[header] = kmers
    sys.stdout.write("%s" % (str(result)))
except subprocess.CalledProcessError as error:
    sys.stderr.write("%s" % (str(error)))

Given example FASTA like this:

>foo
TTAACG
>bar
GTGGAAGTTCTTAGGGCATGGCAAAGAGTCAGAATTTGAC

For k=6, you would get an iterable Python dictionary like this:

{'foo': {'TTAACG': 1, 'CGTTAA': 1}, 'bar': {'GTTCTT': 1, 'AGAACT': 1, 'GAGTCA': 1, 'ATGGCA': 1, 'GAACTT': 1, 'ATTCTG': 1, 'CTAAGA': 1, 'CTTCCA': 1, 'ATTTGA': 1, 'GGAAGT': 1, 'AGGGCA': 1, 'CCTAAG': 1, 'CTCTTT': 1, 'AATTTG': 1, 'TCTGAC': 1, 'TTTGCC': 1, 'CTTAGG': 1, 'TTTGAC': 1, 'GAAGTT': 1, 'CCCTAA': 1, 'AGAATT': 1, 'AGTCAG': 1, 'CTGACT': 1, 'TCTTAG': 1, 'CGTTAA': 1, 'GTGGAA': 1, 'TGCCAT': 1, 'ACTCTT': 1, 'GGGCAT': 1, 'TTAGGG': 1, 'CTTTGC': 1, 'TGGAAG': 1, 'GACTCT': 1, 'CATGCC': 1, 'GCAAAG': 1, 'AAATTC': 1, 'GTCAAA': 1, 'TGACTC': 1, 'TAGGGC': 1, 'AAGTTC': 1, 'ATGCCC': 1, 'TCAAAT': 1, 'CAAAGA': 1, 'AACTTC': 1, 'GTCAGA': 1, 'CAAATT': 1, 'TAAGAA': 1, 'CATGGC': 1, 'AAGAAC': 1, 'AAGAGT': 1, 'TCTTTG': 1, 'TTCCAC': 1, 'TGGCAA': 1, 'GGCAAA': 1, 'AGTTCT': 1, 'AGAGTC': 1, 'TCAGAA': 1, 'GAATTT': 1, 'AAAGAG': 1, 'TGCCCT': 1, 'CCATGC': 1, 'GGCATG': 1, 'TTGCCA': 1, 'CAGAAT': 1, 'AATTCT': 1, 'GCATGG': 1, 'ACTTCC': 1, 'TTCTTA': 1, 'GCCATG': 1, 'GCCCTA': 1, 'TTCTGA': 1}}

You can use standard Python calls to manipulate this dictionary object and get sums of counts per record, for sequence, etc. which seems to answer your question. Please feel free to clarify what you're looking for if this object representation is not clear.

For a fully-fleshed out demonstration, see: https://github.com/alexpreynolds/kmer-counter/blob/master/test/kmer-test.py

If you are working with large values of k, then you might look at other tools, like Jellyfish. They use data structures to give an approximate count for large values of k that are not tenable for strict word counts.

ADD COMMENT
3
Entering edit mode
3.7 years ago

You can use Jellyfish, KMC3, kmerfreq, BFCounter, ntCard . . .

ADD COMMENT
3
Entering edit mode
3.7 years ago
wulj2 ▴ 50

i would prefer this one: https://github.com/lh3/yak by Heng Li

ADD COMMENT
0
Entering edit mode

why they use always 31 as maximum size of kmer ?

ADD REPLY
1
Entering edit mode

A canonical 31-mer can be represented with a 64-bit unsigned integer. Larger word lengths could be used at the expense of more memory and time, so that's a common cutoff when using that data structure to store a kmer key for presence/absence or counts.

If you use a larger k, tools like Jellyfish use a data structure called a Bloom filter (https://en.wikipedia.org/wiki/Bloom_filter) to be able to manage performance and storage, with the trade-off of giving an inexact count.

ADD REPLY
0
Entering edit mode

Hey Alex do you have a example in how to use your kmer count in a banch of files in different directories? Thanks

ADD REPLY
1
Entering edit mode

Just point the binary at the file. See https://github.com/alexpreynolds/kmer-counter/blob/master/test/makefile#L19 for a basic example. If you have a lot of files, iterate through a bash array or similar scripted approach.

ADD REPLY
0
Entering edit mode

Where is that specified?

ADD REPLY

Login before adding your answer.

Traffic: 2129 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6