Minimize Number Of Motifs Describing A Peptide Set
3
2
Entering edit mode
12.4 years ago
julieN ▴ 250

I have a collection of peptide sequences that were pulled down by polyclonal antibodies generated against a (longish) specific antigen. They have been hand-curated to segregate the sequences into expected positives and expected negatives. As I have found to be typical for this type of experiment, it looks like different antibodies within the polyclonals are recognizing different parts of the antigen.

We've defined about 20 submotifs from the antigen that will cover most of the expected positives with low/no occurence in the expected negative set, but there is significant overlap among them. My goal is to cover the largest number of expected positives with the fewest motifs. Literature and Google searches on protein motif/antigen/computation pull up tons of papers, but they are mostly on motif discovery. My question is whether this is a known class of problem, and, if so, what is it called?

motif • 2.3k views
ADD COMMENT
2
Entering edit mode
12.4 years ago
Qdjm 1.9k

Not sure what the general class of problem would be, sounds a bit like a rule learning problem.

However, you try could solve it using LASSO-regularized logistic regression via glmnet.R.

  • Make each motif an input feature (i.e. independent variables), and the value of that feature (i.e. x_i) for a particular sequence is the number of occurrences of the motif in the sequence.
  • Make the target values 1 for a positive sequence and 0 for a negative.
  • The resulting non-zero regression coefficients returned by glmnet are the non-redundant set.

glmnet gives you the entire regularization path which means that for each motif set size, it will provide a guess at the best non-redundant set.

ADD COMMENT
1
Entering edit mode

Thank you @qdjm! This is what I was looking for, and would never have found otherwise. I'm still not sure of the name of the problem ("subset selection" maybe), but I've been able to pull out similar literature, and that's what I'm after.

ADD REPLY
2
Entering edit mode
12.4 years ago

You could try Comparimotif - this will allow you to compare the two sets of motifs (I'm guessing you've defined them as regular expressions?). If you search your query motif against your database of motifs (i.e. all your motifs) it will compute the overlaps, matches, degeneracy, variants, parents etc.

I've no idea what the overall class of problem is called though.

ADD COMMENT
1
Entering edit mode

Thanks @niallhaslam--this is a nicer implementation of what I have produced with my own scripts, and I will use it in the future.

ADD REPLY
0
Entering edit mode
9.6 years ago

is your data available? i really need this for my thesis. help me. Thank You

ADD COMMENT

Login before adding your answer.

Traffic: 2466 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6