Question

Getting The Atom And Strands From A Pdb Xml File

3

Entering edit mode

13.9 years ago

Will 4.6k

I'm trying to extract some information from a set of PDB xml files. The difficulty I'm having is associating ATOMs with CHAINs. In the "flat-file" format the chain ID is on the same line as the ATOM information so its pretty easy. However, in the XML version it seems that this info is serpateed between the PDBx:pdbx_poly_seq_scheme and PDBx:atom_site. I'm having trouble making sure that I'm associating the correct atom_site objects with the correct records in the poly_seq_objects. I've included some examples from each .. these come from pdb:1BIS.

The simple question: What ids should I use to 'join' these records, its not obvious from the documentation.

atom_sites:

     <PDBx:atom_site id="1015">
     <PDBx:B_iso_or_equiv>24.26</PDBx:B_iso_or_equiv>
     <PDBx:B_iso_or_equiv_esd xsi:nil="true" />
     <PDBx:Cartn_x>-4.448</PDBx:Cartn_x>
     <PDBx:Cartn_x_esd xsi:nil="true" />
     <PDBx:Cartn_y>-14.262</PDBx:Cartn_y>
     <PDBx:Cartn_y_esd xsi:nil="true" />
     <PDBx:Cartn_z>4.417</PDBx:Cartn_z>
     <PDBx:Cartn_z_esd xsi:nil="true" />
     <PDBx:auth_asym_id>A</PDBx:auth_asym_id>
     <PDBx:auth_atom_id>O</PDBx:auth_atom_id>
     <PDBx:auth_comp_id>LEU</PDBx:auth_comp_id>
     <PDBx:auth_seq_id>172</PDBx:auth_seq_id>
     <PDBx:group_PDB>ATOM</PDBx:group_PDB>
     <PDBx:label_alt_id></PDBx:label_alt_id>
     <PDBx:label_asym_id>A</PDBx:label_asym_id>
     <PDBx:label_atom_id>O</PDBx:label_atom_id>
     <PDBx:label_comp_id>LEU</PDBx:label_comp_id>
     <PDBx:label_entity_id>1</PDBx:label_entity_id>
     <PDBx:label_seq_id>126</PDBx:label_seq_id>
     <PDBx:occupancy>1.00</PDBx:occupancy>
     <PDBx:occupancy_esd xsi:nil="true" />
     <PDBx:pdbx_PDB_ins_code xsi:nil="true" />
     <PDBx:pdbx_PDB_model_num>1</PDBx:pdbx_PDB_model_num>
     <PDBx:pdbx_formal_charge xsi:nil="true" />
     <PDBx:type_symbol>O</PDBx:type_symbol>
  </PDBx:atom_site>
  <PDBx:atom_site id="1016">
     <PDBx:B_iso_or_equiv>25.89</PDBx:B_iso_or_equiv>
     <PDBx:B_iso_or_equiv_esd xsi:nil="true" />
     <PDBx:Cartn_x>-3.267</PDBx:Cartn_x>
     <PDBx:Cartn_x_esd xsi:nil="true" />
     <PDBx:Cartn_y>-16.870</PDBx:Cartn_y>
     <PDBx:Cartn_y_esd xsi:nil="true" />
     <PDBx:Cartn_z>6.060</PDBx:Cartn_z>
     <PDBx:Cartn_z_esd xsi:nil="true" />
     <PDBx:auth_asym_id>A</PDBx:auth_asym_id>
     <PDBx:auth_atom_id>CB</PDBx:auth_atom_id>
     <PDBx:auth_comp_id>LEU</PDBx:auth_comp_id>
     <PDBx:auth_seq_id>172</PDBx:auth_seq_id>
     <PDBx:group_PDB>ATOM</PDBx:group_PDB>
     <PDBx:label_alt_id></PDBx:label_alt_id>
     <PDBx:label_asym_id>A</PDBx:label_asym_id>
     <PDBx:label_atom_id>CB</PDBx:label_atom_id>
     <PDBx:label_comp_id>LEU</PDBx:label_comp_id>
     <PDBx:label_entity_id>1</PDBx:label_entity_id>
     <PDBx:label_seq_id>126</PDBx:label_seq_id>
     <PDBx:occupancy>1.00</PDBx:occupancy>
     <PDBx:occupancy_esd xsi:nil="true" />
     <PDBx:pdbx_PDB_ins_code xsi:nil="true" />
     <PDBx:pdbx_PDB_model_num>1</PDBx:pdbx_PDB_model_num>
     <PDBx:pdbx_formal_charge xsi:nil="true" />
     <PDBx:type_symbol>C</PDBx:type_symbol>
  </PDBx:atom_site>

poly_seqs

      <PDBx:pdbx_poly_seq_scheme asym_id="A" entity_id="1" mon_id="SER" seq_id="11">
     <PDBx:auth_mon_id>SER</PDBx:auth_mon_id>
     <PDBx:auth_seq_num>57</PDBx:auth_seq_num>
     <PDBx:hetero>n</PDBx:hetero>
     <PDBx:ndb_seq_num>11</PDBx:ndb_seq_num>
     <PDBx:pdb_ins_code></PDBx:pdb_ins_code>
     <PDBx:pdb_mon_id>SER</PDBx:pdb_mon_id>
     <PDBx:pdb_seq_num>57</PDBx:pdb_seq_num>
     <PDBx:pdb_strand_id>A</PDBx:pdb_strand_id>
  </PDBx:pdbx_poly_seq_scheme>
  <PDBx:pdbx_poly_seq_scheme asym_id="A" entity_id="1" mon_id="PRO" seq_id="12">
     <PDBx:auth_mon_id>PRO</PDBx:auth_mon_id>
     <PDBx:auth_seq_num>58</PDBx:auth_seq_num>
     <PDBx:hetero>n</PDBx:hetero>
     <PDBx:ndb_seq_num>12</PDBx:ndb_seq_num>
     <PDBx:pdb_ins_code></PDBx:pdb_ins_code>
     <PDBx:pdb_mon_id>PRO</PDBx:pdb_mon_id>
     <PDBx:pdb_seq_num>58</PDBx:pdb_seq_num>
     <PDBx:pdb_strand_id>A</PDBx:pdb_strand_id>
  </PDBx:pdbx_poly_seq_scheme>
  <PDBx:pdbx_poly_seq_scheme asym_id="A" entity_id="1" mon_id="GLY" seq_id="13">
     <PDBx:auth_mon_id>GLY</PDBx:auth_mon_id>
     <PDBx:auth_seq_num>59</PDBx:auth_seq_num>
     <PDBx:hetero>n</PDBx:hetero>
     <PDBx:ndb_seq_num>13</PDBx:ndb_seq_num>
     <PDBx:pdb_ins_code></PDBx:pdb_ins_code>
     <PDBx:pdb_mon_id>GLY</PDBx:pdb_mon_id>
     <PDBx:pdb_seq_num>59</PDBx:pdb_seq_num>
     <PDBx:pdb_strand_id>A</PDBx:pdb_strand_id>

If you're asking why I don't just use the flat-files ... its because I have some other structures which have only been generated in the XML format and cannot be regenerated in the flat-file format.

Thanks a bunch, Will

pdb parsing xml • 6.0k views

ADD COMMENT • link updated 13.9 years ago by Aleksandr Levchuk 3.2k • written 13.9 years ago by Will 4.6k

1

Entering edit mode

How exactly are you parsing this?

Taking the flat file entry for atom site 1015:

ATOM   1015  O   LEU A 172      -4.448 -14.262   4.417  1.00 24.26           O

I will freely admit that I know nothing about PDB files, but is the 'A' in this entry a reference to the chain? So isn't this the same info that is in [?]A[?] in the atom_sites? It seems to represent the same information. I may have completely missed your point however, so feel free to point it out if I have..

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 13.8 years ago by User 59 13k

1

Entering edit mode

no, it does not seem to be the chain id since when I visually scan through records with multiple chains the label_asym_id doesn't change.

I'm parsing this with Python's xml library. I'm not having any trouble getting the info out, just linking these two pieces of information.

ADD REPLY • link 13.8 years ago by Will 4.6k

0

Entering edit mode

and I've noticed that the atom_sites and poly_schemes are not in the same order so I can't just match base on residue identities

ADD REPLY • link 13.9 years ago by Will 4.6k

0

Entering edit mode

if its not clear what I'm asking for then leave a comment and I'll try to clear it up :)

ADD REPLY • link 13.8 years ago by Will 4.6k

0

Entering edit mode

Is there a 1-to-1 match between ATOMs and CHAINs? Can you provide a link to complete XML files?

ADD REPLY • link 13.8 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

When you say "in flat-file format the chain ID is on the same line as the ATOM" specifically how does the data look? For example http://www.pdb.org/pdb/files/1BIS.pdb the atom with id=1015 the chain ID would be simply "A"?

ADD REPLY • link 13.8 years ago by Aleksandr Levchuk 3.2k

Ram · Answer 1 · 2010-12-30

Use PDBx:label_seq_id tag value in atom_sites, e.g.:

<PDBx:atom_site id="1015">
  <...>
  <PDBx:label_seq_id>126</PDBx:label_seq_id>
  <...>
</PDBx:atom_site>

and match that to seq_id attribute in poly_seqs, e.g:

<PDBx:pdbx_poly_seq_scheme asym_id="B" ... seq_id="126">
   <PDBx:auth_mon_id>LEU</PDBx:auth_mon_id>
   <...>
</PDBx:pdbx_poly_seq_scheme>

Now you can join the atom_site items with the poly_seq_scheme items.

Take for example: http://www.pdb.org/pdb/files/1BIS.xml - There are 3274 atom sites, and in those PDBx:label_seq_id ranges from 10 to 163.

And the poly_seq_schemes have all the ids from 1 to 166. So it's one poly_seq_scheme to many atom sites.

The other candidate (PDBx:auth_seq_id) does not fly because it's ranges from 56 to 1078 in PDBx:atom_siteCategory.

score 1 · Answer 2 · 2010-12-26

1

Entering edit mode

13.8 years ago

Hanif Khalak ★ 1.3k

This might work: try converting to "Sequence-Coordinates Correspondence" (SC) format using xml2pdb, and then you can directly correlate what you see in the XML with what you end up with in the SC-formatted chains.

From the authors: "Users who wish additional records in their PDB files are urged to edit the source files and recompile the program."

ADD COMMENT • link 13.8 years ago by Hanif Khalak ★ 1.3k

0

Entering edit mode

its just silly to convert them to another format just to get a small piece of data out when the data I need is right there.

ADD REPLY • link 13.8 years ago by Will 4.6k

0

Entering edit mode

I was thinking the code which generates the output in xml2pdb could point to the link between chain/atom

ADD REPLY • link 13.8 years ago by Hanif Khalak ★ 1.3k

score 1 · Answer 3 · 2010-12-26

I found this explanation on the web site : ProtBuD

Here is the interesting part :

Parsing XML files.
PDBML (Westbrook, et al., 2005) is part of the uniformity project (Bhat,
et al., 2001) of the PDB. The PDB XML data files preserve the logical data model of the PDB Exchange Data Dictionary (Westbrook and Fitzgerald, 2003). Data can be retrieved quickly from XML files, and most software development environments provide libraries to read and write XML files. From the XML files, we retrieve the following data:

the entity_id and name for each type of molecule in the structure, for each entity_id, the asymmetric unit contents in terms of asym_ids; there may be several asym_ids for a given entity_id the biological unit contents consisting of symmetry operators applied to asym_ids for protein and nucleic acid polymers, the author chain IDs for each asym_id molecule to provide links with other databases such as PQS and SCOP that use the author chain IDs; the XML files provide the information that the author chain ID’s may be blank, but this information is only provided for polypeptide entities. information on covalent attachments and modified residues, defined in terms of asym_ids and residue numbers, and atom names structural determination data such as experiment type, space group, transformation matrices for converting to unit cell coordinates, missing residues, resolution, and R-factors We use the asym_ids in the XML files to link the required information for asymmetric and biological units. Since the biological units are defined in terms of the asym_ids and symmetry operators of the space group, the asym_ids are sufficient for defining both asymmetric and biological units in the XML files. However, ligands are not always assigned properly to specific biological units. Often when an asymmetric unit is broken up into more than one biological unit, all of the non-polymer ligands are assigned to the first unit. This is a limitation of the current state of the PDB and may be resolved in future releases of the PDB (J. Westbrook and H. Berman, personal communication). Covalent attachments and modified residues are identified uniquely in the XML files, and are connected to other data fields by asym_ids, residue numbers, and atom names. The categories used in our database are described in Table 1.

Ram · Answer 4 · 2010-12-29

0

Entering edit mode

13.8 years ago

Payal ▴ 150

open(FH,"1atp.pdb");
while($a=<FH>)
{
if($a=~/^ATOM/)
{
$s=substr($a,21,1);
if($s eq "I")
{
print FH1 $a;
open(FH1,>>"ichain.pdb")
}
}}

ADD COMMENT • link updated 5.1 years ago by Ram 44k • written 13.8 years ago by Payal ▴ 150

0

Entering edit mode

Can you please add some context... this is Perl, right?

ADD REPLY • link 13.8 years ago by Egon Willighagen 5.4k

0

Entering edit mode

-1 Looks fake to me...

ADD REPLY • link 13.8 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

This is for the pdb flat-file PDB format. @Will - I wish you did not loose those.

ADD REPLY • link 13.8 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

-1 for code that does not work

$ echo 'open(FH,"1BIS.pdb"); while($a=) { if($a=~/^ATOM/) { $s=substr($a,21,1); if($s eq "I") { print FH1 $a; open(FH1,>>"ichain.pdb") } }}' | perl
syntax error at - line 1, near "=) "
syntax error at - line 1, near ",>>"
syntax error at - line 1, near "} }"`

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 13.8 years ago by Aleksandr Levchuk 3.2k

0

Entering edit mode

I'll change my rating to an Up vote once the code is fixed. This is what I initially wanted to do before I checked if your Perl code actually works. Please consider looking at other languages like R, Python, or Ruby.