Question

Evaluating protein structure predictions

0

Entering edit mode

4.8 years ago

Anand Rao ▴ 630

I am new to protein structure prediction. A protein structure informatics expert at my institution advised me to first check quality of structure predictions, before any downstream use. My 2-step pipeline is: Step 1. Structure prediction with LOMETS or I-TASSER Step 2. Structure evaluation with ProQ or QMEAN

I am most interested in the F-box domain. From PDB-RCSB database, crystal structure is known for > 10 proteins that contain this F-box domain.

As a practice run, I predicted structure for 2 F-box domain sequences. Those sequences are: One sequence from PF00646 seed alignment (LALTKLPPELLVQVLSHVPPRALVTRCRPVCRAWRDLVDGPSIWLLQLA) Another sequence from 1FQV-A of PDB-RCSB (VSWDSLPDELLLGIFSCLCLPELLKVSGVCKRWYRLASDESLWQTLD) Based on the "source" of these sequences, they must be bonafide F-box domains. So they are 2 positive controls for my pipeline.

I make these inferences from my ProQ evaluation results (please see screenshot of results in the image below):

For both LOMETS and I-TASSER methods, and for both sequences, based on ProQ LGscore, the models are deemed "very good models"
For both LOMETS and I-TASSER methods, for seed sequence, based on MaxSub score, the models are not even "fairly good"
For both LOMETS, and I-TASSER methods, for sequence from solved PDB, based on MaxSub score, the models are only "fairly good"

My questions are as follows:

1. Is it valid to evaluate predicted protein structures for short sequences? IF yes, then is there still a minimum length limit?

2. Why are my MaxSub scores so poor?

3. Can I use only the LGscore results to decide whether I will accept or reject a predicted structure? If yes, then how will I set the cutoff? Please note, I am using predicted secondary structures for the ProQ evalations.

4. Same questions, but about my QMEANS results

THANKS!

protein structure evaluation ProQ QMEAN • 1.3k views

ADD COMMENT • link updated 4.8 years ago by jgreener ▴ 390 • written 4.8 years ago by Anand Rao ▴ 630

score 3 · Accepted Answer · 2019-07-01

Your sequences are ~50 residues, which is fine for protein structure prediction. The minimum length limit is a grey area and depends on things like available templates, secondary structure v. disorder propensity and availability of sequences to get coevolution constraints. But for sure you can do meaningful prediction for 50 residue proteins - see the CASP targets for some examples.

To get a feel for why a model has a certain score, you should look at the paper of the method and see what it uses in its assessment. They might also suggest a cutoff for a good/bad model. I would consider using a more modern model quality assessment method than MaxSub such as ProQ3, SVMQA, SBROD etc. See recent CASP publications on estimation of model accuracy if you want rankings of these methods and ideas on what thresholds to use.

In general though homology models, particularly of close homologs, tend to be very accurate at the fold level. You should consider what you will be using the models for downstream. For getting a structure at the fold level they should be fine. For virtual screening of a binding site, probably less so, and protein structure prediction for this kind of fine-grained application is a little sketchy.