Tool for Protein Alignment/Superimposition
8 days ago
antoniaa ▴ 10

Hi everybody,

I have about 2000 protein structures which are needed to be classified based on their structural and sequential similarities. I need to classify them into several groups then pick up one representative for each group. Do you have any relible and useful tool suggestion for that? By the way, I also need to note the RMSD values for each. I appreciate if you can help me!!! Thank you!

7 days ago
Mensur Dlakic ★ 11k

There are many tools for structural alignment of protein structures - see here. Here is couple of examples I have used and know that they will provide the information you want.

However, grouping protein structures by structural similarity is a non-trivial task, especially selecting a representative structure. You may want to check out the existing classifications of protein structures and compare yours against the known grouping.

The reason I say this is because RMSD values and sequence identity of aligned structures are often not enough to unambiguously assign the relatedness between structures. Here is an example of the two proteins that share less than 20% identity and have a fairly high RMSD value:

Yet they are definitely related, which is fairly obvious when one looks only at the alignment of their catalytic residues (the view is from the top of the previous image):

Thank you so much! I check the sites for existing classifications but I could not find. Since I am really new-comer to this field, what do you suggest to me?

In my opinion, this is not a project for a beginner to do without supervision, especially on such a large number of proteins.

As to the classification, let's say that a PDB structure 4pze is one of those you are interested in. Simply go to the ECOD site listed above, choose search by PDB ID and enter 4pze. You will get this output:

http://prodata.swmed.edu/ecod/complete/search?kw=4pze&type=pdbid

If you do the same exercise in CATH database, this will be the output:

http://www.cathdb.info/search?q=4pze

That particular protein has two domains, as you will see from classification. I don't kow which one would be interesting to you - that you will have to figure out by yourself. ECOD is updated weekly, so you may want to stick with that database. CATH is also update fairly frequently, but I think not weekly. SCOP database is update least often, so you may want to skip it.

Now, you don't want to copy and paste a PDB code 2000 times into the search box, but all these databases have downloadable files that can be parsed locally for matches to your group of proteins. Once you find your domain of interest, it is pretty straightforward to match your IDs to the existing classification, though it may not be so for you if you have never done it. My point still stands: it would be much faster and more accurate to use the classification from existing databases, than to classify on your own based on RMSD comparisons.

