Question: Minimize Number Of Motifs Describing A Peptide Set
2
9.1 years ago by
julieN250
Cambridge, MA USA
julieN250 wrote:

I have a collection of peptide sequences that were pulled down by polyclonal antibodies generated against a (longish) specific antigen. They have been hand-curated to segregate the sequences into expected positives and expected negatives. As I have found to be typical for this type of experiment, it looks like different antibodies within the polyclonals are recognizing different parts of the antigen.

We've defined about 20 submotifs from the antigen that will cover most of the expected positives with low/no occurence in the expected negative set, but there is significant overlap among them. My goal is to cover the largest number of expected positives with the fewest motifs. Literature and Google searches on protein motif/antigen/computation pull up tons of papers, but they are mostly on motif discovery. My question is whether this is a known class of problem, and, if so, what is it called?

motif • 1.5k views
modified 6.4 years ago by karlpersius.manlimos0 • written 9.1 years ago by julieN250
2
9.1 years ago by
Qdjm1.9k
Toronto
Qdjm1.9k wrote:

Not sure what the general class of problem would be, sounds a bit like a rule learning problem.

However, you try could solve it using LASSO-regularized logistic regression via glmnet.R.

• Make each motif an input feature (i.e. independent variables), and the value of that feature (i.e. x_i) for a particular sequence is the number of occurrences of the motif in the sequence.
• Make the target values 1 for a positive sequence and 0 for a negative.
• The resulting non-zero regression coefficients returned by glmnet are the non-redundant set.

glmnet gives you the entire regularization path which means that for each motif set size, it will provide a guess at the best non-redundant set.

1

Thank you @qdjm! This is what I was looking for, and would never have found otherwise. I'm still not sure of the name of the problem ("subset selection" maybe), but I've been able to pull out similar literature, and that's what I'm after.

2
9.1 years ago by
Niallhaslam2.3k
Dublin
Niallhaslam2.3k wrote:

You could try Comparimotif - this will allow you to compare the two sets of motifs (I'm guessing you've defined them as regular expressions?). If you search your query motif against your database of motifs (i.e. all your motifs) it will compute the overlaps, matches, degeneracy, variants, parents etc.

I've no idea what the overall class of problem is called though.

1

Thanks @niallhaslam--this is a nicer implementation of what I have produced with my own scripts, and I will use it in the future.

0
6.4 years ago by
Philippines
karlpersius.manlimos0 wrote:

is your data available? i really need this for my thesis. help me. Thank You