Question: Statistical Significance Of Sequence-Based Descriptor Data
gravatar for Spyros
7.9 years ago by
Spyros10 wrote:

Hello Biostar community,

I have a sequence analysis question. I am using a database of protein fingerprints (fingerprint = a set of several distinct sequence motifs excised from multiple sequence alignments). I am investigating functionally important ares (ligand binding, protein-protein, protein-ion interactions) and how they relate to these motifs that are function/structure-agnostic sequence descriptors (if they can, and how frequently they are found to lie within motifs). Although there are some interesting correlations appearing, e.g. 60% of protein-ion interaction residues fall within motifs, I would need to apply a statistical significance test to ascertain if motifs coinciding with functional residues is important or due to chance.

Practically this would go like: if 30% of a sequence is comprised of motifs, then any functional residue (and any residue picket at random on the primary sequence of the polypeptide) has a 30% probability to be within a motif by chance alone. Is there any probabilistic model (probability distribution, e.g. binomial test) that can be used for statistical significance testing?

Many thanks and apologies for the lengthy text!

sequence statistics • 1.5k views
ADD COMMENTlink written 7.9 years ago by Spyros10
gravatar for Michael Kuhn
7.9 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

I'm not sure if I understand your approach, but if in doubt: Shuffle the data, and observe the background distribution. From this you can calculate empirical p-values.

You have to be careful, though, what you shuffle: If there are hidden correlations in the data and you destroy those, you would get low p-values, but these might be caused by the destroyed correlations.

ADD COMMENTlink written 7.9 years ago by Michael Kuhn5.0k

@Michael: Thank you for the input, when you say "shuffle" the data do you mean take a multiple sequence alignment (where motifs are excised from), randomly re-distribute the amino acid residues? And is "observing the background distribution" referring to consensus sequences/sequence logos? I'm not sure how programmatically I would go about doing that... Thanks again for your ideas!

ADD REPLYlink written 7.9 years ago by Spyros10

I don't understand you approach enough to suggest how to shuffle correctly, perhaps you could use a random set of starting sequences and create the multiple alignment from those (this would of course only work if you start with a certain subset of proteins).

ADD REPLYlink written 7.9 years ago by Michael Kuhn5.0k

The approach is not really a formal one, I just simplify amino acid residue occurrence to "fit" a binomial distribution => if len(fingerprint) = 20% of my protein sequence then there is a 20% chance that any coincidence between fingerprint residues and functional residues will occur purely due to chance. I then assume independent trias, the positions of motifs and functional residues are statistically independent, and then apply the binomial test. Thank you for your input, I'll think about how to programmatically implement such a randomized shuffling of my alignment sequences.

ADD REPLYlink written 7.8 years ago by Spyros10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2370 users visited in the last hour