Question: Ideal sequence % identity for profile construction
gravatar for Anand Rao
5.2 years ago by
Anand Rao350
United States
Anand Rao350 wrote:

Any multiple sequence alignment (MSA) can be converted to a profile HMM (pHMM). And I DO understand that mathematical modeling of the diversity at each alignment position in an MSA can be used to score matches using something like HMMER2 / HMMER3 / HHpred etc.

However, I am curious to know if there are established guidelines for what % identity amongst sequences should be ideally, in order to balance signal and noise in the pHMM, so that both sensitivity and specificity of detecting sequence homologs are as high as possible.

I could argue that an MSA composed of sequences that are < 20% pair-wise identity would be hard to justify without solid evidence of structural or functional equivalence despite poor sequence conservation. So where should I stop in terms of diversity of sequences during MSA inference, if I am going to build pHMMs from these MSAs?

Links to any published literature on this topic would be much appreciated. Thanks folks!

ADD COMMENTlink modified 5.2 years ago by 5heikki9.3k • written 5.2 years ago by Anand Rao350
gravatar for 5heikki
5.2 years ago by
5heikki9.3k wrote:

Pfam has been around for quite some time, so perhaps it would be a good idea to read up on their methodology? My guess is that there's no universal optimal value..

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by 5heikki9.3k

I've analyzed seed sequences for 14,831 Pfam profiles, and indeed as you suspect, there is no uniform average pairwise sequence % identity for these profiles. Some of them are really low (< 20%). How can you infer an accurate MSA when % identity is so low? False positive rates in such cases are very high. So I might question the validity of these MSAs and the pHMMs inferred from them - doesn't matter if PFam builds them or I build them!

At least that is my current stance. But I would love for someone to correct me or educate me on this aspect. Thanks for your reply.

ADD REPLYlink modified 14 months ago by Ram32k • written 5.2 years ago by Anand Rao350

If I remember correctly, they somehow control the false discovery/positive rate by calculating the p-Values with respect to the protein family, i.e., each pHMM has an adjustment associated. Search for "gathering threshold"...

However, for mathematical modelling in general, there is no need for a good overall similarity. It is enough to identify the features that are unique for a particular family/group/whatsoever. Hypothetically, imagine that a particular sequence of 10 amino-acids out of a 1000 AA protein is unique to all proteins carrying out a specific function while no other protein happens to have this sequence... then you need to train your profile to target exactly these 10 AA... not more not less. The remaining AA sequence does not matter, but a 1% overall sequence similarity is enough to answer your question.

By the way, this reduction of data dimensions happens all over in Bioinformatics, from biomarker discovery (ignore genes that are not a different between control and experiment) to sequence classification (remove uninformative sequence-parts)...

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by Manuel Landesfeind1.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1546 users visited in the last hour