A few weeks ago I was looking for a tool that would help me get "DNA composition" statistics for my sequencing. Something that would give me a dataset which which I could ask questions about GC bias, or over-represented sequences, motifs, etc. There are tools to answer each question specifically, but I was looking for something more general from which many analyses could be built on. This led me to k-mer counting, and all the down-stream tools which leverage k-mer result files.
k-mer counting tools are pretty cool, but all the ones I tried had some draw backs; like high RAM requirements, very long run-times (although the latests bloom-filter based tools seem to mitigate this somewhat), but most importantly requiring a specific k-mer size to be chosen. I really wanted all mers in the dataset so I could look at ''GC' and 'GCG' and 'GCCGACGGACGAC' without having to re-run any analyses. I couldn't find a tool like this after a brief search, so gave up and wrote my own in NumPy based off suffix-arrays.
Two weeks later I have a functional program in the sense that I get results, but before I invest any time making it usable for others, I thought I should investigate further if there are tools which already do this. Making a suffix array was a nice learning experience for me so I haven't lost anything if such tools already exist - and if they do i'd love to compare performance characteristics - but if not I might consider tidying up the code and making proper documentation. Does anyone know of such tools?
Thank you so much, and happy Diwali :)