Can anyone recommend a software solution to do this:
- Input: about 100,000 short peptide sequences -- unaligned -- of varying lengths, but mostly under 20 residues.
- Output: amino-acid profiles (e.g. sequence logo map) describing similar over-represented kmers (say, 3-or 4- or 5-mers).
I can think of ways to tackle this myself*, but why re-invent the wheel? Hoping that my question and any discussion that follows may also help others.
* PS. My approach would be something like this:
- count all unique kmers
- calculate pairwise distances
- select clusters (clades) of similar kmers
- use these kmers (and their counts) to build sequence logo maps