Question: Statistical Significance Of Sequence-Based Descriptor Data
1
7.9 years ago by
Spyros10
Spyros10 wrote:

Hello Biostar community,

I have a sequence analysis question. I am using a database of protein fingerprints (fingerprint = a set of several distinct sequence motifs excised from multiple sequence alignments). I am investigating functionally important ares (ligand binding, protein-protein, protein-ion interactions) and how they relate to these motifs that are function/structure-agnostic sequence descriptors (if they can, and how frequently they are found to lie within motifs). Although there are some interesting correlations appearing, e.g. 60% of protein-ion interaction residues fall within motifs, I would need to apply a statistical significance test to ascertain if motifs coinciding with functional residues is important or due to chance.

Practically this would go like: if 30% of a sequence is comprised of motifs, then any functional residue (and any residue picket at random on the primary sequence of the polypeptide) has a 30% probability to be within a motif by chance alone. Is there any probabilistic model (probability distribution, e.g. binomial test) that can be used for statistical significance testing?

Many thanks and apologies for the lengthy text!

sequence statistics • 1.5k views
written 7.9 years ago by Spyros10
1
7.9 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

I'm not sure if I understand your approach, but if in doubt: Shuffle the data, and observe the background distribution. From this you can calculate empirical p-values.

You have to be careful, though, what you shuffle: If there are hidden correlations in the data and you destroy those, you would get low p-values, but these might be caused by the destroyed correlations.

@Michael: Thank you for the input, when you say "shuffle" the data do you mean take a multiple sequence alignment (where motifs are excised from), randomly re-distribute the amino acid residues? And is "observing the background distribution" referring to consensus sequences/sequence logos? I'm not sure how programmatically I would go about doing that... Thanks again for your ideas!

I don't understand you approach enough to suggest how to shuffle correctly, perhaps you could use a random set of starting sequences and create the multiple alignment from those (this would of course only work if you start with a certain subset of proteins).