Question: Bug in JellyFish and DSK k-mer counting tool?
0
gravatar for student-t
10 months ago by
student-t380
Australia
student-t380 wrote:

I have some k-mers that neither Jellyfish nor DSK was able to count. Is that a bug in the programs?

This is my script for JellyFish

  1. wget https://s3.amazonaws.com/sequins/others/ABCD.fq

  2. grep TTACATAACACCCATTGTGGCGGCTGCAAGT ABCD.fq | wc

    41 41 5166

    The 31 length k-mer is in the sample file and there are many of them.

  3. jellyfish count -m 31 -s 100M -C ABCD.fq -o 1.jf

  4. jellyfish dump 1.jf -c > 1.txt
  5. grep TTACATAACACCCATTGTGGCGGCTGCAAGT 1.txt

    returns nothing

Then I tried DSK,

  1. dsk -kmer-size 31 -abundance-min 0 -file ABCD.fq -out-dir ABCD -out ABCD.h5
  2. dsk2ascii -file ABCD.h5 -out ABCD.txt
  3. grep TTACATAACACCCATTGTGGCGGCTGCAAGT *.txt

    still returns nothing

As you can see, the 31 k-mer is in the input file (ABCD.fq) and there are many of them not being filtered away. I don't see a problem in the input file - it came out from a simulator. It is a valid FASTQ file. Why both programs weren't able to count my k-mer?

jellyfish dsk k-mer • 425 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by student-t380
1

Have you tried grepping for the reverse-complement?

Also, maybe you should try the BBMap package's KmerCountExact tool. It will, coincidentally, report that specific kmer by default, because it stores the higher, rather than lower, of each kmer/reverse-kmer pair. The other tools store the lower one.

ADD REPLYlink modified 10 months ago • written 10 months ago by Brian Bushnell14k
1

DSK indeed will use a canonical form of the kmer that corresponds to the minimum value of the kmer and its reverse-complement. So in the example one should grep ACTTGCAGCCGCCACAATGGGTGTTATGTAA.

ADD REPLYlink written 9 months ago by edrezen680

Brian. Thanks for the tip. I will give a try. I will try the BBMap tool and report back.

ADD REPLYlink modified 10 months ago • written 10 months ago by student-t380

Why would both JellyFish and DSK count on reverse-complement?

ADD REPLYlink modified 10 months ago • written 10 months ago by student-t380
2

It's faster and uses less memory to store either a kmer OR its reverse-complement instead of both, and for most purposes, it makes no difference because if a kmer exists in a genome, its reverse-complement must also exist because DNA is double-stranded. So, counting only one or the other cuts memory use in half.

ADD REPLYlink written 10 months ago by Brian Bushnell14k
1

Counting 'canonical' k-mers makes sense if your input genome assembly is incomplete..

jellyfish count --help
..
 -C, --canonical                          Count both strand, canonical representation (false)
..
ADD REPLYlink written 9 months ago by 5heikki6.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 751 users visited in the last hour