Question

Minimize Number Of Motifs Describing A Peptide Set

2

Entering edit mode

13.6 years ago

julieN ▴ 250

I have a collection of peptide sequences that were pulled down by polyclonal antibodies generated against a (longish) specific antigen. They have been hand-curated to segregate the sequences into expected positives and expected negatives. As I have found to be typical for this type of experiment, it looks like different antibodies within the polyclonals are recognizing different parts of the antigen.

We've defined about 20 submotifs from the antigen that will cover most of the expected positives with low/no occurence in the expected negative set, but there is significant overlap among them. My goal is to cover the largest number of expected positives with the fewest motifs. Literature and Google searches on protein motif/antigen/computation pull up tons of papers, but they are mostly on motif discovery. My question is whether this is a known class of problem, and, if so, what is it called?

motif • 2.9k views

ADD COMMENT • link updated 10.8 years ago by karlpersius.manlimos • 0 • written 13.6 years ago by julieN ▴ 250

score 2 · Answer 1 · 2011-11-30

Not sure what the general class of problem would be, sounds a bit like a rule learning problem.

However, you try could solve it using LASSO-regularized logistic regression via glmnet.R.

Make each motif an input feature (i.e. independent variables), and the value of that feature (i.e. x_i) for a particular sequence is the number of occurrences of the motif in the sequence.
Make the target values 1 for a positive sequence and 0 for a negative.
The resulting non-zero regression coefficients returned by glmnet are the non-redundant set.

glmnet gives you the entire regularization path which means that for each motif set size, it will provide a guess at the best non-redundant set.

score 2 · Answer 2 · 2011-11-30

2

Entering edit mode

13.6 years ago

Niallhaslam 2.3k

You could try Comparimotif - this will allow you to compare the two sets of motifs (I'm guessing you've defined them as regular expressions?). If you search your query motif against your database of motifs (i.e. all your motifs) it will compute the overlaps, matches, degeneracy, variants, parents etc.

I've no idea what the overall class of problem is called though.

ADD COMMENT • link 13.6 years ago by Niallhaslam 2.3k

1

Entering edit mode

Thanks @niallhaslam--this is a nicer implementation of what I have produced with my own scripts, and I will use it in the future.

ADD REPLY • link 13.6 years ago by julieN ▴ 250

score 0 · Answer 3 · 2014-09-08

0

Entering edit mode

10.8 years ago

karlpersius.manlimos • 0

is your data available? i really need this for my thesis. help me. Thank You

ADD COMMENT • link 10.8 years ago by karlpersius.manlimos • 0