Question: Bug in JellyFish and DSK k-mer counting tool?
0
gravatar for SmallChess
2.2 years ago by
SmallChess470
Australia
SmallChess470 wrote:

I have some k-mers that neither Jellyfish nor DSK was able to count. Is that a bug in the programs?

This is my script for JellyFish

  1. wget https://s3.amazonaws.com/sequins/others/ABCD.fq

  2. grep TTACATAACACCCATTGTGGCGGCTGCAAGT ABCD.fq | wc

    41 41 5166

    The 31 length k-mer is in the sample file and there are many of them.

  3. jellyfish count -m 31 -s 100M -C ABCD.fq -o 1.jf

  4. jellyfish dump 1.jf -c > 1.txt
  5. grep TTACATAACACCCATTGTGGCGGCTGCAAGT 1.txt

    returns nothing

Then I tried DSK,

  1. dsk -kmer-size 31 -abundance-min 0 -file ABCD.fq -out-dir ABCD -out ABCD.h5
  2. dsk2ascii -file ABCD.h5 -out ABCD.txt
  3. grep TTACATAACACCCATTGTGGCGGCTGCAAGT *.txt

    still returns nothing

As you can see, the 31 k-mer is in the input file (ABCD.fq) and there are many of them not being filtered away. I don't see a problem in the input file - it came out from a simulator. It is a valid FASTQ file. Why both programs weren't able to count my k-mer?

jellyfish dsk k-mer • 1.1k views
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by SmallChess470
2

Have you tried grepping for the reverse-complement?

Also, maybe you should try the BBMap package's KmerCountExact tool. It will, coincidentally, report that specific kmer by default, because it stores the higher, rather than lower, of each kmer/reverse-kmer pair. The other tools store the lower one.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Brian Bushnell16k
2

DSK indeed will use a canonical form of the kmer that corresponds to the minimum value of the kmer and its reverse-complement. So in the example one should grep ACTTGCAGCCGCCACAATGGGTGTTATGTAA.

ADD REPLYlink written 2.2 years ago by edrezen720

Brian. Thanks for the tip. I will give a try. I will try the BBMap tool and report back.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by SmallChess470

Why would both JellyFish and DSK count on reverse-complement?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by SmallChess470
3

It's faster and uses less memory to store either a kmer OR its reverse-complement instead of both, and for most purposes, it makes no difference because if a kmer exists in a genome, its reverse-complement must also exist because DNA is double-stranded. So, counting only one or the other cuts memory use in half.

ADD REPLYlink written 2.2 years ago by Brian Bushnell16k
2

Counting 'canonical' k-mers makes sense if your input genome assembly is incomplete..

jellyfish count --help
..
 -C, --canonical                          Count both strand, canonical representation (false)
..
ADD REPLYlink written 2.2 years ago by 5heikki8.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 983 users visited in the last hour