Hi all, I am using jellyfish to count kmers of length 31 that appear in a viral genome sequence. These were the commands I ran:
jellyfish count -m 31 -s 154675 -t 10 NC_001798.fa # the -s parameter is based on the size of the NC_001798 genome
jellyfish dump mer_counts.jf > mer_counts_dumps.fa
I then take a random 31mer from mer_counts_dumps.fa (e.g. GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG, which is output to have a count of 13) and grep for that same 31mer in the original input NC_001798.fa file. This is the command I run (to account for 31mers that might go across a line break):
cat NC_001798.fa | tr -d " \t\n\r" | grep -o GGGCGGGGGTCGGGCGGGCGGGGGTCGGGCG | wc -l
However, this only returns 5, which tells me that the 31mer does not appear 13 times in the fasta file (only 5 times). Does anyone know what may be causing the discrepancy? I also tried using kmercountexact.sh from the BBMap suite and it also outputs a count of 13 for this specific 31mer, so I'm wondering if my method of grepping for the 31mer in the fasta file is erroneous. I have this problem for multiple 31mers with a count greater than 1 in mer_counts_dumps.fa.
Thanks!
Best, Elaine
Oh, this is great! Exactly the tool I need. Thank you very much.