Tool:New method for protein secondary structure prediction
1
1
Entering edit mode
4.6 years ago
vytarasov ▴ 180

I have already announced here a new program for molecular biologists who are mac users - BioLabDonkey - https://apps.apple.com/us/app/biolabdonkey/id1470827582?ls=1&mt=12

Now I would like to discuss a new method for protein secondary structure prediction implemented in this program. Any questions are welcome, what can be improved or changed etc. The description of the method can be found at this website - https://molbiolinfo.home.blog/biolabdonkey-features

Update: I see better, more specific prediction with new aa grouping, see the updated post - https://molbiolinfo.home.blog/biolabdonkey-features

protein secondary-structure • 1.4k views
ADD COMMENT
1
Entering edit mode
4.6 years ago
Mensur Dlakic ★ 27k

It appears that your solution is predicated on simplicity and speed of calculation rather than accuracy, but still: what is the Q3 accuracy on a proper validation dataset?

There are other speedy approaches that are based on average residue properties, such as Chou-Fasman. Why not implement something like that? Most people already know what kind of prediction to expect from Chou-Fasman. Also, a simple average of your method with Chou-Fasman (or any other, for that matter) is almost guaranteed to give a better prediction.

https://github.com/jseidel5/chou-fasman-algorithm

JPred API may be useful as well:

https://github.com/MoseleyBioinformaticsLab/jpredapi

ADD COMMENT
0
Entering edit mode

The Chou-Fasman is based on the relative frequencies of each amino acid in alpha helices, beta sheets, and turns from PDB. My method is better than the Chou-Fasman since it takes in account combinations of aa in a patern, i.e. the relative positions of aa in the patterns.

The Q3 accuracy on a proper validation dataset can/should be done.

Look at my website, I have added to the post some random examples of accuracy evaluation tests. The method gives the similar results as Jpred 4 for the proteins with no similarity to the sequences with known PDB, and it is much faster.

The question of the averaging when results are the same is ok, but it is problematic when the results are different, what to choose - one method can be wrong and second correct. What is the mechanism of averaging ?

ADD REPLY
0
Entering edit mode

It is difficult to judge the quality of a predictive model based on few examples, but the examples you provided lead me to believe that this is not a good model. This is not based only on absolute accuracy, but also on types of errors.

Most good secondary structure models mistake helices with coils, or strands with coils. They do not mistake helices with strands, at least not in long stretches. See the alignment below for what I mean (dssp line is real structure)

>dssp
CEEEEEECCCHHHCHHHHHHHHHHHCCCEEEEECCCCCHHHCCHHHHHHCCCEEEEEEECCEEEEEEEECCCCCEEEECCCCCCHHHHCCEEEEEEEECCEEEEEEEEECCCCCCCCCCCHHHHHHHHHHHHHHHHHHHCCCCCCEEEEEECCCCCCHHHCCCCHHHHHHHHHHCCCCCCHHHHHHHHHHHHCCEEEHHHHHCCCCCCCCCCCCCCCCHHHHCCCCCCEEEEEEHHHHCCEEEEEECHHHHCCCCCCCCCCEEEEECC
>predicted
CEEEEEECCCCCCCHHHHHHHHHCCCCCEEEEECCCCCCCCCCHHHHHHCCCEEEEECCCCCCEEEEEECCCCHHHEECCCCCCCCCCCCEEEEEEECCCCCEEEEEEECCCCCCCCCCHHHHHHHHHHHHHHHHHHHHHCCCCCEEEECCCCCCCCCCCCCCCCCCHHHCCCCCCCCCCHHHHHHHHHHHHCCCEEHHHHHCCCCCCCCCCCCCCCCCCCCCCCCEEEEEEECHHHHHHHHHCCCCHHHCCCCCCCCCCEEEEEEEC

Just in your first sequence (SufSE96A) I count five stretches of good size where helices are mistaken with strands or vice versa. Not only will that bring down the prediction accuracy, but it also lowers the segment overlap value (see here).

What is the mechanism of averaging ?

If your predictions are absolute, meaning your method only spits out a single state per position, there will be no way to average predictions. What is the average of C and H? However, most SS predictive models output probabilities for each possible outcome, and those can be averaged.

ADD REPLY
0
Entering edit mode

Which protein is in your alignment example ? And, which prediction method is used there ? I would like to compare with my method.

From where this rule of a good secondary structure models comes, could you give a reference.

I have doubts in the strength of this rule since, "Limitations are also imposed by secondary structure prediction's inability to account for tertiary structure; for example, a sequence predicted as a likely helix may still be able to adopt a beta-strand conformation if it is located within a beta-sheet region of the protein and its side chains pack well with their neighbors. (wiki)"

What is in the comparison of PDB and the prediction for SufSE96A is in the line with this statement from wiki when helices can be mistaken with strands resulting from the tertiary structure constrains.

ADD REPLY
1
Entering edit mode

The sequence from the example I used previously:

>1ako_A
MKFVSFNINGLRARPHQLEAIVEKHQPDVIGLQETKVHDDMFPLEEVAKLGYNVFYHGQK
GHYGVALLTKETPIAVRRGFPGDDEEAQRRIIMAEIPSLLGNVTVINGYFPQGESRDHPI
KFPAKAQFYQNLQNYLETELKRDNPVLIMGDMNISPTDLDIGIGEENRKRWLRTGKCSFL
PEEREWMDRLMSWGLVDTFRHANPQTADRFSWFDYRSKGFDDNRGLRIDLLLASQPLAEC
CVETGIDYEIRSMEKPSDHAPVWATFRR

It was predicted using this server, though it will probably save you some time if you go directly to results.

To understand types of errors in secondary structure, I suggest you read classic neural network papers by Rost and Sander, including this review and this paper.

As to why predictions rarely mistake helix for strand and vice versa, it has to do with their unique structural properties. Below is a plot for distributions of phi and psi angles of protein residues from high-quality PDB structures. In upper left corner all residues are plotted, and then if you go clockwise from there it will be for helix, strand and coil residues.

enter image description here

You will see that there is almost no overlap in distributions of helix and strand residues, while coil residues overlap with both categories.

If we repeat the same exercise with kappa and alpha angles, the result will be qualitatively the same.

enter image description here

The point is that helices and strands are so distinct structurally that good predictive programs have no problem distinguishing between them, even though they are not using structural information.

As to other predictive servers, there are literally hundreds of them. I list several that I have used and know to be of good quality.

http://bioinf.cs.ucl.ac.uk/psipred/

https://zhanglab.ccmb.med.umich.edu/PSSpred/

http://www.compbio.dundee.ac.uk/jpred/

http://distilldeep.ucd.ie/porter/

http://scratch.proteomics.ics.uci.edu

ADD REPLY
0
Entering edit mode

Thank you for the information.

As we can see from the plots for helix and strand, there is : 1. an overlapping spot of the same intensity at position around 50 (y) : 50 (x), 2. an overlapping spot of different intensity and size at position around -50 (y) : -100 (x), 3. an overlapping spot of very different intensity and size at position around 120 (y) : -50 (x),

From the plots it is clear that to say that there is no probability at all for a sequence to adopt helix or strand is incorrect.
And, of course, there are preferred phi and psi angles fro strand and helix, it is correct.

Since the the random coil is not a true secondary structure, but rather the conformations with no regular secondary structure, it is not a surprise that it is better if the prediction method can be mistaken in helix-coil or strand-coil prediction, as predicted coil can actually adopt in real tertiary structure any regular secondary structure.

Overall my method has helix over-prediction, and probably I need to apply an additional filter to the predicted helices. One thing wich I see is also that helix regions is often longer than in PDB, that is indication that at both ends the tertiary structure constrains can play a role.

ADD REPLY

Login before adding your answer.

Traffic: 1949 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6