Personally I think amino acids E, M, A, and L are commonly found in helix and V, I, Y, F, W, L in sheet, but the biopython package assumes that E, M, A, L are in sheet while V, I, Y, F, W, L in helix. Which one is correct?
To be specific, I'm working on a research project about enzyme and want to use biopython package to compute secondary structure fraction. I looked at the source code (https://github.com/biopython/biopython/blob/master/Bio/SeqUtils/ProtParam.py) and here's how secondary structure fraction is defined by biopython package:
def secondary_structure_fraction(self):
"""
Calculate fraction of helix, turn and sheet.
Returns a list of the fraction of amino acids which tend to be in Helix, Turn or Sheet.
Amino acids in helix: V, I, Y, F, W, L.
Amino acids in Turn: N, P, G, S.
Amino acids in sheet: E, M, A, L.
Returns a tuple of three floats (Helix, Turn, Sheet).
"""
aa_percentages = self.get_amino_acids_percent()
helix = sum(aa_percentages[r] for r in "VIYFWL")
turn = sum(aa_percentages[r] for r in "NPGS")
sheet = sum(aa_percentages[r] for r in "EMAL")
return helix, turn, sheet
I looked at the webpage here https://biopython.org/docs/1.76/api/Bio.SeqUtils.ProtParam.html but no reference is listed to support why biopython defines it in this way. So I searched literature and textbooks, and surprisingly found that amino acids E, M, A, and L are commonly found in helix while V, I, Y, F, W and L are prevalent in sheet. If so, then I assume the code should be modified as follows:
def secondary_structure_fraction(self):
aa_percentages = self.get_amino_acids_percent()
helix = sum(aa_percentages[r] for r in "EMAL")
turn = sum(aa_percentages[r] for r in "NPGS")
sheet = sum(aa_percentages[r] for r in "VIYFWL")
return helix, turn, sheet
I'm new to protein science and not sure which one is correct. Does anyone have an idea?
Thank you for your reply! The website is also helpful.