Question: Feature/Motif Extraction from Sequences?
gravatar for ad
3.0 years ago by
United States
ad30 wrote:

Hello, I’m looking to do feature extraction from sets of sequences with a minimum of assumptions for subsequent downstream comparison with other sets. Like for example given ATGAGGA , TTGGCGTA, for category 1 and GGTTGGTT, CCTTAAT for category 2 determine what category AGGAAGEA is in

What are the usual ways to go around with this sort of thing?

How would you extract features which wouldn’t necessarily conform to a fixed size kmers? An nmer at one location might be related to an bmer at some distance for example.

I’ve look at strategies such as ‘bag of words’. But they seem unsuited to the problem because among other things you don’t even know the dictionary to break the string into in the first place.

gene snp alignment sequence genome • 825 views
ADD COMMENTlink modified 3.0 years ago by simon.vanheeringen200 • written 3.0 years ago by ad30

I would say it wholly depends on the nature of your subsequent downstream comparison. What is the question and with what purpose do you want to do the analysis? Can you clarify? There's a whole body of work on k-mer/motif analysis, and you might not want/need to re-invent the wheel.

For instance, if you want to work with k-mers there is the kmer-SVM software ( ), which works very well in classification. It is based on a gapped k-mer model. However, due to the black box-like nature of a SVM, interpretability can be a problem. If you are interested in motif analysis (ie transcription factor binding sites), you can use de novo motif finders. Some work on k-mer-based models, other use other approaches. My own software GimmeMotifs ( is an example. Widely used programs are Homer and MEME.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by simon.vanheeringen200

I have a set of CHIPpeaks. I want to extract the sequences and then use some of them to predict other sequences. the rest of the peaks will be the evaluation set. I'd like two solutions or just one if it can fit both criteria. 1. Something that is at least partially interpretable 2. The best performance (accuracy)

ADD REPLYlink written 3.0 years ago by ad30
gravatar for simon.vanheeringen
3.0 years ago by
simon.vanheeringen200 wrote:

Given your clarification, I think kmer-SVM does exactly what you want. You can train and evaluate performance (using cross-validation). The SVM model can then be used to predict new sequence. The k-mers will have associated weights, that you would be able to cluster, match to known motifs, etc.

ADD COMMENTlink written 3.0 years ago by simon.vanheeringen200
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 841 users visited in the last hour