Question

PDBePISA calls monomers heteromers, how can one programmatically correct this downstream

0

Entering edit mode

2.7 years ago

plberry ▴ 30

I have been tasked to create a data table that contains quaternary structure and whether the complex a protein forms is a homomer or heteromer for as many proteins as this information is known in the human proteome, and the binding energies for these complexes. (IE: any solution that involves me manually looking through PDBe entries is a non-starter.)

Leaving out the difficulty in nailing down a non-redundant PDB proteome, I have encountered a consistent problem with how the PDB and therefore all tools for working with it classify "homomers" vs "heteromers."

For example, 4HHB - human deoxyhemoglobin , is identified as a heterotetramer. It contains four chains total, two pairs of identical chains. This is the same classification I would want to have in my data table.

However, 5DB7 - human DNA polymerase beta is also identified as a heterotetramer by PDBe. It only consists of ONE AA chain, and is identified in Uniprot as a monomer. I would want it to be a monomer in my data table. It appears that ANY sequence, even if that sequence is not an amino acid sequence in the PDB file forces a classification as a "heteromer" and the extra non-protein stuff in the biological assembly gets counted as another subunit for the purposes of the assembly composition.

Is there a way to programmatically figure out which proteins are true heteromers for my purposes like 4HHB, and which are actually monomers or homomers like 5DB7? I have tried working with the PDBePISA queries, but even ONLY checking Protein or P interactions in the FILTER section it produces 5DB7 as a tetramer, and the resulting datatable when doing a database search has no information on whether a protein complex is homo- or heteromeric that I can see.

Right now the only option I can see for moving forward is writing a webscraper that will sort through each https://www.ebi.ac.uk/pdbe/entry/pdb/XXX/analysis page once I get a PDB complexes and see if the different sections have proteins or other stuff in them, and do some simple arithmetic to find out what's leftover when the non-protein chains are taken out. Then each will have to be checked against each other since each uniprot code generally has more than one PDB id associated with it. This is obviously not ideal because it will be a lot of sequential hits on the PDBe website and will take a long time.

PDBe quaternary PISA structure • 482 views

ADD COMMENT • link 2.7 years ago by plberry ▴ 30