Hello, I’m looking to do feature extraction from sets of sequences with a minimum of assumptions for subsequent downstream comparison with other sets. Like for example given ATGAGGA , TTGGCGTA, for category 1 and GGTTGGTT, CCTTAAT for category 2 determine what category AGGAAGEA is in
What are the usual ways to go around with this sort of thing?
How would you extract features which wouldn’t necessarily conform to a fixed size kmers? An nmer at one location might be related to an bmer at some distance for example.
I’ve look at strategies such as ‘bag of words’. But they seem unsuited to the problem because among other things you don’t even know the dictionary to break the string into in the first place.