Question

Bug in JellyFish and DSK k-mer counting tool?

0

Entering edit mode

8.0 years ago

scchess ▴ 640

I have some k-mers that neither Jellyfish nor DSK was able to count. Is that a bug in the programs?

This is my script for JellyFish

wget https://s3.amazonaws.com/sequins/others/ABCD.fq
grep TTACATAACACCCATTGTGGCGGCTGCAAGT ABCD.fq | wc

41 41 5166

The 31 length k-mer is in the sample file and there are many of them.
jellyfish count -m 31 -s 100M -C ABCD.fq -o 1.jf
jellyfish dump 1.jf -c > 1.txt
grep TTACATAACACCCATTGTGGCGGCTGCAAGT 1.txt

returns nothing

Then I tried DSK,

dsk -kmer-size 31 -abundance-min 0 -file ABCD.fq -out-dir ABCD -out ABCD.h5
dsk2ascii -file ABCD.h5 -out ABCD.txt
grep TTACATAACACCCATTGTGGCGGCTGCAAGT *.txt

still returns nothing

As you can see, the 31 k-mer is in the input file (ABCD.fq) and there are many of them not being filtered away. I don't see a problem in the input file - it came out from a simulator. It is a valid FASTQ file. Why both programs weren't able to count my k-mer?

dsk jellyfish k-mer • 3.3k views

ADD COMMENT • link 8.0 years ago by scchess ▴ 640

2

Entering edit mode

Have you tried grepping for the reverse-complement?

Also, maybe you should try the BBMap package's KmerCountExact tool. It will, coincidentally, report that specific kmer by default, because it stores the higher, rather than lower, of each kmer/reverse-kmer pair. The other tools store the lower one.

ADD REPLY • link 8.0 years ago by Brian Bushnell 20k

2

Entering edit mode

DSK indeed will use a canonical form of the kmer that corresponds to the minimum value of the kmer and its reverse-complement. So in the example one should grep ACTTGCAGCCGCCACAATGGGTGTTATGTAA.

ADD REPLY • link 7.9 years ago by edrezen ▴ 730

0

Entering edit mode

Brian. Thanks for the tip. I will give a try. I will try the BBMap tool and report back.

ADD REPLY • link 8.0 years ago by scchess ▴ 640

0

Entering edit mode

Why would both JellyFish and DSK count on reverse-complement?

ADD REPLY • link 8.0 years ago by scchess ▴ 640

3

Entering edit mode

It's faster and uses less memory to store either a kmer OR its reverse-complement instead of both, and for most purposes, it makes no difference because if a kmer exists in a genome, its reverse-complement must also exist because DNA is double-stranded. So, counting only one or the other cuts memory use in half.

ADD REPLY • link 7.9 years ago by Brian Bushnell 20k

2

Entering edit mode

Counting 'canonical' k-mers makes sense if your input genome assembly is incomplete..

jellyfish count --help
..
 -C, --canonical                          Count both strand, canonical representation (false)
..

ADD REPLY • link 7.9 years ago by 5heikki 11k