I'm running a competition at Kaggle.com on HIV-1 Progression ... check it out if you're interested, there's a 500 USD prize in it for the winner! There have been a number of machine-learning researchers with no biology background looking for a resource which can extract information about a NT sequence (or batch of sequences) that they can use as "feature-sets" for their machine-learning algorithms.
So far I've suggested k-mers, multiple-alignments, and known resistance mutations. I've even provided code for finding the count of all k-mers in a sequence. Does anyone have any other suggestions ... especially if they have tools that can do the feature-extraction.
Thanks a bunch,