I have some k-mers that neither Jellyfish nor DSK was able to count. Is that a bug in the programs?
This is my script for JellyFish
grep TTACATAACACCCATTGTGGCGGCTGCAAGT ABCD.fq | wc
41 41 5166
The 31 length k-mer is in the sample file and there are many of them.
jellyfish count -m 31 -s 100M -C ABCD.fq -o 1.jf
- jellyfish dump 1.jf -c > 1.txt
grep TTACATAACACCCATTGTGGCGGCTGCAAGT 1.txt
returns nothing
Then I tried DSK,
- dsk -kmer-size 31 -abundance-min 0 -file ABCD.fq -out-dir ABCD -out ABCD.h5
- dsk2ascii -file ABCD.h5 -out ABCD.txt
grep TTACATAACACCCATTGTGGCGGCTGCAAGT *.txt
still returns nothing
As you can see, the 31 k-mer is in the input file (ABCD.fq) and there are many of them not being filtered away. I don't see a problem in the input file - it came out from a simulator. It is a valid FASTQ file. Why both programs weren't able to count my k-mer?
Have you tried grepping for the reverse-complement?
Also, maybe you should try the BBMap package's KmerCountExact tool. It will, coincidentally, report that specific kmer by default, because it stores the higher, rather than lower, of each kmer/reverse-kmer pair. The other tools store the lower one.
DSK indeed will use a canonical form of the kmer that corresponds to the minimum value of the kmer and its reverse-complement. So in the example one should grep ACTTGCAGCCGCCACAATGGGTGTTATGTAA.
Brian. Thanks for the tip. I will give a try. I will try the BBMap tool and report back.
Why would both JellyFish and DSK count on reverse-complement?
It's faster and uses less memory to store either a kmer OR its reverse-complement instead of both, and for most purposes, it makes no difference because if a kmer exists in a genome, its reverse-complement must also exist because DNA is double-stranded. So, counting only one or the other cuts memory use in half.
Counting 'canonical' k-mers makes sense if your input genome assembly is incomplete..