A few weeks ago I was looking for a tool that would help me get "DNA composition" statistics for my sequencing. Something that would give me a dataset which which I could ask questions about GC bias, or over-represented sequences, motifs, etc. There are tools to answer each question specifically, but I was looking for something more general from which many analyses could be built on. This led me to k-mer counting, and all the down-stream tools which leverage k-mer result files.
k-mer counting tools are pretty cool, but all the ones I tried had some draw backs; like high RAM requirements, very long run-times (although the latests bloom-filter based tools seem to mitigate this somewhat), but most importantly requiring a specific k-mer size to be chosen. I really wanted all mers in the dataset so I could look at ''GC' and 'GCG' and 'GCCGACGGACGAC' without having to re-run any analyses. I couldn't find a tool like this after a brief search, so gave up and wrote my own in NumPy based off suffix-arrays.
Two weeks later I have a functional program in the sense that I get results, but before I invest any time making it usable for others, I thought I should investigate further if there are tools which already do this. Making a suffix array was a nice learning experience for me so I haven't lost anything if such tools already exist - and if they do i'd love to compare performance characteristics - but if not I might consider tidying up the code and making proper documentation. Does anyone know of such tools?
Thank you so much, and happy Diwali :)
For fixed k-mer, use kmc2 and DSK, especially when you have a fast disk. They are faster and use less memory than jellyfish.
If you don't want to use fixed k-mer, the right data structure is suffix array, FM-index or something equivalent. You can query any sequences up to the read length. Nonetheless, for a particular k-mer size, it is usually not as efficient as k-mer based methods.
This is an interesting question because even though there are dozens of k-mer tools available they do all seem to have the same limitations: very short k-mer length requirements (usually <32bp) and fixed length search. It seems like there are some new tools reporting performance improvements but they still seem to have those features. Tallymer (part of GenomeTools) is the only tool, to my knowledge, that can compute statistics for an arbitrary length (or range) of k-mers. It sounds like you want the 'occratio' subcommand, which runs for a range of k-mers (as opposed to the 'mkindex' command which runs for a fixed length). Here is an example to get a range of unique/nonunique k-mers and get the relative amounts.
gt tallymer occratio -output unique nonunique -minmersize 10 -maxmersize 180 -esa db gt tallymer occratio -output unique relative -minmersize 10 -maxmersize 180 -esa db
Note, you have to create 'db' first, but this was just an example. I can't go into the details, but I know that Tallymer uses a data structure called an enhanced suffix array. This is by the same author of Vmatch, and MUMmer (and many other tools). I should mention that Vmatch is also amazingly flexible for doing k-mer arithmetic and we could probably come up with an equivalent command using that tool. I agree with you about learning by doing, but this is a problem I would personally not try to solve since there are great tools available that allow you to answer virtually any question you may have about genomes and k-mers.
BBMap has a very fast program for kmer-counting, KmerCountExact. Usage:
kmercountexact.sh in=reads.fq out=counts.fasta k=20
An optional flag "prefilter=2", for example, will use Bloom filters to ignore kmers with counts of 2 or less. Unfortunately, it does require a fixed kmer length per run, but at least for short kmers, it uses only a trivial amount of memory. It does not use disk at all for temp files.
However, it's hard to give any more specific advice as I'm not sure what you're trying to do. If I want to look at GC characteristics of data, for example, I'd probably want to plot a gc content histogram of reads, or a gc-vs-coverage plot, or something like that... and text files full of kmers and their counts don't really seem overly useful for those applications, unless you are dealing with very short kmers for your motif-overrepresentation question..