Forum: Check Kmer Count Results
gravatar for hatirlatici
3.6 years ago by
hatirlatici0 wrote:


Recently, I was interviewed by a bioinfo company. They gave me a quick test about kmer counting. I came up with a program which takes three parameters, e.g. ./progname fastqfile kmer-length top-count. So it was supposed to produce most top-count number frequented kmer-length kmers from fastqfile. So if you type ./progname sample.fastq 30 25, it should produce top 25 repeated 30-length kmers list in sample.fastq. I thought I was able to it. But I got rejected three times. They said over an email that the results was wrong. I've used KMC2 to test my result. I've produced a text file sorted it out by frequency of 30 kmers and took first 25 lines of the file. The results were the same too. I am very confused. Is there any place online to check that given fastq sample file, it gives results for all specific length kmers displayed ? I can share my results here too.

Thanks .

sequence forum • 1.8k views
ADD COMMENTlink written 3.6 years ago by hatirlatici0

Perhaps also look at results from khmer or Jellyfish.

ADD REPLYlink written 3.6 years ago by Alex Reynolds30k

I also checked them but the results did not match. So how can someone be sure that given fastq file, the results of a kmer counts is accurate ...

ADD REPLYlink written 3.6 years ago by hatirlatici0

You can try from BBMap. @Brian Bushnell participates in this forum regularly. If you have any specific questions he can address them.

ADD REPLYlink written 3.6 years ago by genomax87k

Thanks for the suggestion. Actually my question was a bit different. There are a lot of kmer counting programs out there, many ... . How can I be confident that this program (BBMap) is certain that it gives the true oputput? If I put this on my console: ./ in=ERR059924.filt.fastq out=counts.txt fastadump=f mincount=10 k=30 rcomp=f

(assume the ERR file is in the same directory), it produces "counts.txt" file. If I use different programs (like kmc, jellyfish, BFCounter, etc.) they also produce the histogram file as BBMap which has two columns: kmer pattern and its number of occurrences. The stuck point for me is how to determine which program gives the true accurate results. Even I use smaller size fastq files, the output files are different. I am really confused. If you say that BBMap is the most accurate and accepted one that I can trust, that'd make my job easy. How many kmer counting programs give different result with the same fastq file input.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by hatirlatici0

I am going to tag Brian Bushnell who is the author of BBMap package.

He can comment on your specific concern but BBMap suite has always produced results that I trust so far.

ADD REPLYlink written 3.6 years ago by genomax87k

I have not used kmc or BFCounter, but BBMap's KmerCountExact and Jellyfish produced identical counts last time I tested them. Basically, all programs should produce identical output, but the definition of identical is a little tricky here.

1) Most counting programs store either a kmer or its reverse-complement. For Jellyfish the default is to store both and it's non-default to just store one canonical representation. For KmerCountExact the default is to store only one, and it's non-default (set rcomp=f) to store both. So, their default outputs and counts are different, but if you change the defaults, they should match.

2) When it comes to canonical kmers, there are two ways to do it - you can either store the greater of the (kmer, reverse kmer) pair, or the lesser. KmerCountExact stores the greater (typically starting with T) while Jellyfish stores the lesser (typically starting with A). So when storing canonical kmers, you would need to reverse-complement the output of one to get the output of the other (for output in fasta format, you can do that with's "rcomp" flag). Alternately, if the programs are storing kmers and their reverse-complements, the output will be equivalent, as long as it's in the same order.

3) I'm not sure about Jellyfish, but KmerCountExact prints kmers in a random order, which is nondeterministic. For 2-column output, you can use the Linux sort utility to compare the 2-column tab-delimited output of 2 programs. For fasta, you'd need to sort by sequence. You could do this with by setting "k=30 hashes=0".

4) Lastly, some programs may do something different, but KmerCountExact (and, IIRC, Jellyfish) ignore all kmers spanning Ns.

ADD REPLYlink written 3.6 years ago by Brian Bushnell17k

Thank you Brian! I thought I was wrong, but let me check...

ADD REPLYlink written 3.6 years ago by hatirlatici0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1376 users visited in the last hour