I have recently been looking at different k-mer tools (E.g., jellyfish). They all perform well with different computational complexities. However, most of them are counting tools. I'm interested in a tool that finds k-mers that are more than expected (more of a probability-based approach). I was wondering if anyone has worked with or seen a tool that generates k-mer counts + a background distribution?
You can use DSK from the GATB project, which is a kmer counter that also provides an histogram of kmer abundance (see README file for more information). For instance:
dsk -file myreads.fa -kmer-size 31
It will produce a HDF5 file from which you can extract the kmers histogram with the following (the h5dump tool is provided with DSK) :
h5dump -y -d dsk/histogram myreads.h5 | grep [0-9] | grep -v [A-Z].* | paste - -
You can plot directly with gnuplot :
h5dump -y -d dsk/histogram myreads.h5 | grep [0-9] | grep -v [A-Z].* | paste - - | gnuplot -p -e 'plot [0:100] "-" with lines'
There is also a tool 'dsk2ascii' that gives the list of (kmers,count) in a human readable format, so you can do some processing on it.