I am trying to generate a contact prediction from PFAM MSAs but I need to reliably map a given protein family sequence (a specific sequence from the MSA from PFAM) with its corresponding PDB sequence.
Take as an example PF00011:
The PFAM reference sequence is: ['D' 'W' 'K' 'E' 'T' 'P' 'E' 'A' 'H' 'V' 'F' 'K' 'A' 'D' 'L' 'P' 'G' 'V' 'K' 'K' 'E' 'E' 'V' 'K' 'V' 'E' 'V' 'E' 'D' 'G' 'N' 'v' 'L' 'V' 'V' 'S 'G' 'E' 'R' 'T' 'k' 'e' 'K' 'E' 'D' 'K' 'N' 'D' 'K' 'W' 'H' 'R' 'V' 'E' 'R' 'S' 'S' 'G' 'K' 'F' 'V' 'R' 'R' 'F' 'R' 'L' 'L' 'E' 'D' 'A' 'K' 'V' 'E' 'E' 'V' 'K' 'A' 'G' 'L' 'E' 'N' 'G' 'V' 'L' 'T' 'V' 'T' 'V' 'P' 'K' 'A' 'E' 'V' 'K' 'K' 'P' 'E' 'V' 'K' 'A' 'I' 'Q' 'I' 'S']
... and loading the PDB sequence using the PFAM-provided PDB-id '2BYU' I get the following sequence: ['N', 'A', 'R', 'M', 'D', 'W', 'K', 'E', 'T', 'P', 'E', 'A', 'H', 'V', 'F', 'K', 'A', 'D', 'L', 'P', 'G', 'V', 'K', 'K', 'E', 'E', 'V', 'K', 'V', 'E', 'V', 'E', 'D', 'G', 'N', 'V', 'L', 'V', 'V', 'S', 'G', 'E', 'R', 'T', 'K', 'E', 'K', 'E', 'D', 'K', 'N', 'D', 'K', 'W', 'H', 'R', 'V', 'E', 'R', 'S', 'S', 'G', 'K', 'F', 'V', 'R', 'R', 'F', 'R', 'L', 'L', 'E', 'D', 'A', 'K', 'V', 'E', 'E', 'V', 'K', 'A', 'G', 'L', 'E', 'N', 'G', 'V', 'L', 'T', 'V', 'T', 'V', 'P', 'K', 'A', 'A', 'I', 'Q', 'I', 'S', 'G']
both sequences are nearly identical with the exception of the additional 'N', 'A', 'R', 'M' at the beginning of the pdb sequence. Is their some reference that allows us to extract the exact-matching sequence from the PDB database?
Thanks in advance, Evan
I don't know what MSA you plan to use - Pfam has several of them for each family - but they may not be diverse enough for reliable contact prediction.