Categorizing Protein Families?
2
3
Entering edit mode
10.8 years ago
ilovepython ▴ 130

Following on my previous question regarding discovering protein homology. After finding sequences of interest against a profile of a family, I want to determine whether these sequences can be categorized into this family or not. How can one score proteins against each other so that they can be grouped as so?

Originally, this "family" was determined via simple statistics (pairwise scoring via z-score and alignment calculated from shuffling of these sequences), although I'm not convinced this is a sophisticated enough to determine membership. Therefore I'm looking for a more sophisticated method of scoring this. There are important secondary structures that I am adding to my scoring function, but beyond this, I can't seem to find much on google regarding this type of scoring.

protein homology scoring scoring • 2.3k views
5
Entering edit mode
10.8 years ago

If you are interested in including secondary and tertiary structure for categorisation I strongly suggest you look at the methods used by Superfamily which is SCOP based and Gene 3D which is CATH based

1
Entering edit mode

Well, if you look at the Superfamily database, you'll find that it is in fact a collection of HMMs just like Pfam, SMART, and InterPro. The difference lies in how they made the multiple sequence alignments.

1
Entering edit mode

Knowing the methodology of how the MSAs are created is indeed critical; many people overlook these two as they are derived via structure rather than function. Note that although gene3d and superfamily are interpro member databases, you need to check interpro's release notes to see how many of the hmms have been integrated. (only about half of Gene3D has been so far). Hence I recommend going to the site directly to get the latest data.

4
Entering edit mode
10.8 years ago

Looking at the accepted answer to your previous question, I wonder why simply running hmmsearch with your custom HMM would not do the job? Building an HMM based on a manually checked multiple sequence alignment and then using it for searching would be the standard way to identify members of a protein family.

Perhaps more important: if you make your own scoring scheme, how will you check if it works better than just using hmmsearch? Surely, if you choose to not use the well tested approach, someone will (and should!) ask you to present evidence that your solution is an improvement.

0
Entering edit mode

Makes sense! I thought there needed to be more rigorous tests to determine membership, but it seems the hmmer incorporates sophisticated ai techniques to determine this.