Question: Getting The Atom And Strands From A Pdb Xml File
3
gravatar for Will
8.2 years ago by
Will4.5k
United States
Will4.5k wrote:

I'm trying to extract some information from a set of PDB xml files. The difficulty I'm having is associating ATOMs with CHAINs. In the "flat-file" format the chain ID is on the same line as the ATOM information so its pretty easy. However, in the XML version it seems that this info is serpateed between the PDBx:pdbx_poly_seq_scheme and PDBx:atom_site. I'm having trouble making sure that I'm associating the correct atom_site objects with the correct records in the poly_seq_objects. I've included some examples from each .. these come from pdb:1BIS.

The simple question: What ids should I use to 'join' these records, its not obvious from the documentation.

atom_sites:

     <PDBx:atom_site id="1015">
     <PDBx:B_iso_or_equiv>24.26</PDBx:B_iso_or_equiv>
     <PDBx:B_iso_or_equiv_esd xsi:nil="true" />
     <PDBx:Cartn_x>-4.448</PDBx:Cartn_x>
     <PDBx:Cartn_x_esd xsi:nil="true" />
     <PDBx:Cartn_y>-14.262</PDBx:Cartn_y>
     <PDBx:Cartn_y_esd xsi:nil="true" />
     <PDBx:Cartn_z>4.417</PDBx:Cartn_z>
     <PDBx:Cartn_z_esd xsi:nil="true" />
     <PDBx:auth_asym_id>A</PDBx:auth_asym_id>
     <PDBx:auth_atom_id>O</PDBx:auth_atom_id>
     <PDBx:auth_comp_id>LEU</PDBx:auth_comp_id>
     <PDBx:auth_seq_id>172</PDBx:auth_seq_id>
     <PDBx:group_PDB>ATOM</PDBx:group_PDB>
     <PDBx:label_alt_id></PDBx:label_alt_id>
     <PDBx:label_asym_id>A</PDBx:label_asym_id>
     <PDBx:label_atom_id>O</PDBx:label_atom_id>
     <PDBx:label_comp_id>LEU</PDBx:label_comp_id>
     <PDBx:label_entity_id>1</PDBx:label_entity_id>
     <PDBx:label_seq_id>126</PDBx:label_seq_id>
     <PDBx:occupancy>1.00</PDBx:occupancy>
     <PDBx:occupancy_esd xsi:nil="true" />
     <PDBx:pdbx_PDB_ins_code xsi:nil="true" />
     <PDBx:pdbx_PDB_model_num>1</PDBx:pdbx_PDB_model_num>
     <PDBx:pdbx_formal_charge xsi:nil="true" />
     <PDBx:type_symbol>O</PDBx:type_symbol>
  </PDBx:atom_site>
  <PDBx:atom_site id="1016">
     <PDBx:B_iso_or_equiv>25.89</PDBx:B_iso_or_equiv>
     <PDBx:B_iso_or_equiv_esd xsi:nil="true" />
     <PDBx:Cartn_x>-3.267</PDBx:Cartn_x>
     <PDBx:Cartn_x_esd xsi:nil="true" />
     <PDBx:Cartn_y>-16.870</PDBx:Cartn_y>
     <PDBx:Cartn_y_esd xsi:nil="true" />
     <PDBx:Cartn_z>6.060</PDBx:Cartn_z>
     <PDBx:Cartn_z_esd xsi:nil="true" />
     <PDBx:auth_asym_id>A</PDBx:auth_asym_id>
     <PDBx:auth_atom_id>CB</PDBx:auth_atom_id>
     <PDBx:auth_comp_id>LEU</PDBx:auth_comp_id>
     <PDBx:auth_seq_id>172</PDBx:auth_seq_id>
     <PDBx:group_PDB>ATOM</PDBx:group_PDB>
     <PDBx:label_alt_id></PDBx:label_alt_id>
     <PDBx:label_asym_id>A</PDBx:label_asym_id>
     <PDBx:label_atom_id>CB</PDBx:label_atom_id>
     <PDBx:label_comp_id>LEU</PDBx:label_comp_id>
     <PDBx:label_entity_id>1</PDBx:label_entity_id>
     <PDBx:label_seq_id>126</PDBx:label_seq_id>
     <PDBx:occupancy>1.00</PDBx:occupancy>
     <PDBx:occupancy_esd xsi:nil="true" />
     <PDBx:pdbx_PDB_ins_code xsi:nil="true" />
     <PDBx:pdbx_PDB_model_num>1</PDBx:pdbx_PDB_model_num>
     <PDBx:pdbx_formal_charge xsi:nil="true" />
     <PDBx:type_symbol>C</PDBx:type_symbol>
  </PDBx:atom_site>

poly_seqs

      <PDBx:pdbx_poly_seq_scheme asym_id="A" entity_id="1" mon_id="SER" seq_id="11">
     <PDBx:auth_mon_id>SER</PDBx:auth_mon_id>
     <PDBx:auth_seq_num>57</PDBx:auth_seq_num>
     <PDBx:hetero>n</PDBx:hetero>
     <PDBx:ndb_seq_num>11</PDBx:ndb_seq_num>
     <PDBx:pdb_ins_code></PDBx:pdb_ins_code>
     <PDBx:pdb_mon_id>SER</PDBx:pdb_mon_id>
     <PDBx:pdb_seq_num>57</PDBx:pdb_seq_num>
     <PDBx:pdb_strand_id>A</PDBx:pdb_strand_id>
  </PDBx:pdbx_poly_seq_scheme>
  <PDBx:pdbx_poly_seq_scheme asym_id="A" entity_id="1" mon_id="PRO" seq_id="12">
     <PDBx:auth_mon_id>PRO</PDBx:auth_mon_id>
     <PDBx:auth_seq_num>58</PDBx:auth_seq_num>
     <PDBx:hetero>n</PDBx:hetero>
     <PDBx:ndb_seq_num>12</PDBx:ndb_seq_num>
     <PDBx:pdb_ins_code></PDBx:pdb_ins_code>
     <PDBx:pdb_mon_id>PRO</PDBx:pdb_mon_id>
     <PDBx:pdb_seq_num>58</PDBx:pdb_seq_num>
     <PDBx:pdb_strand_id>A</PDBx:pdb_strand_id>
  </PDBx:pdbx_poly_seq_scheme>
  <PDBx:pdbx_poly_seq_scheme asym_id="A" entity_id="1" mon_id="GLY" seq_id="13">
     <PDBx:auth_mon_id>GLY</PDBx:auth_mon_id>
     <PDBx:auth_seq_num>59</PDBx:auth_seq_num>
     <PDBx:hetero>n</PDBx:hetero>
     <PDBx:ndb_seq_num>13</PDBx:ndb_seq_num>
     <PDBx:pdb_ins_code></PDBx:pdb_ins_code>
     <PDBx:pdb_mon_id>GLY</PDBx:pdb_mon_id>
     <PDBx:pdb_seq_num>59</PDBx:pdb_seq_num>
     <PDBx:pdb_strand_id>A</PDBx:pdb_strand_id>

If you're asking why I don't just use the flat-files ... its because I have some other structures which have only been generated in the XML format and cannot be regenerated in the flat-file format.

Thanks a bunch, Will

pdb xml parsing • 2.5k views
ADD COMMENTlink modified 8.2 years ago by Aleksandr Levchuk3.1k • written 8.2 years ago by Will4.5k
1

How exactly are you parsing this? Taking the flat file entry for atom site 1015: "ATOM 1015 O LEU A 172 -4.448 -14.262 4.417 1.00 24.26 O" I will freely admit that I know nothing about PDB files, but is the 'A' in this entry a reference to the chain? So isn't this the same info that is in [?]A[?] in the atom_sites? It seems to represent the same information. I may have completely missed your point however, so feel free to point it out if I have..

ADD REPLYlink written 8.2 years ago by Daniel Swan13k
1

no, it does not seem to be the chain id since when I visually scan through records with multiple chains the label_asym_id doesn't change.

I'm parsing this with Python's xml library. I'm not having any trouble getting the info out, just linking these two pieces of information.

ADD REPLYlink written 8.2 years ago by Will4.5k

and I've noticed that the atom_sites and poly_schemes are not in the same order so I can't just match base on residue identities

ADD REPLYlink written 8.2 years ago by Will4.5k

if its not clear what I'm asking for then leave a comment and I'll try to clear it up :)

ADD REPLYlink written 8.2 years ago by Will4.5k

Is there a 1-to-1 match between ATOMs and CHAINs? Can you provide a link to complete XML files?

ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k

When you say "in flat-file format the chain ID is on the same line as the ATOM" specifically how does the data look? For example http://www.pdb.org/pdb/files/1BIS.pdb the atom with id=1015 the chain ID would be simply "A"?

ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k
4
gravatar for Aleksandr Levchuk
8.2 years ago by
United States
Aleksandr Levchuk3.1k wrote:

Use PDBx:label_seq_id tag value in atom_sites, e.g.:

<PDBx:atom_site id="1015">
  <...>
  <PDBx:label_seq_id>126</PDBx:label_seq_id>
  <...>
</PDBx:atom_site>

and match that to seq_id attribute in poly_seqs, e.g:

<PDBx:pdbx_poly_seq_scheme asym_id="B" ... seq_id="126">
   <PDBx:auth_mon_id>LEU</PDBx:auth_mon_id>
   <...>
</PDBx:pdbx_poly_seq_scheme>

Now you can join the atom_site items with the poly_seq_scheme items.

Take for example: http://www.pdb.org/pdb/files/1BIS.xml - There are 3274 atom sites, and in those PDBx:label_seq_id ranges from 10 to 163. And the poly_seq_schemes have all the ids from 1 to 166. So it's one poly_seq_scheme to many atom sites.

The other candidate (PDBx:auth_seq_id) does not fly because it's ranges from 56 to 1078 in PDBx:atom_siteCategory.

ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Aleksandr Levchuk3.1k

perfect. Exactly what I was looking for!

ADD REPLYlink written 8.2 years ago by Will4.5k
1
gravatar for Hanif Khalak
8.2 years ago by
Hanif Khalak1.2k
Doha, QA
Hanif Khalak1.2k wrote:

This might work: try converting to "Sequence-Coordinates Correspondence" (SC) format using xml2pdb, and then you can directly correlate what you see in the XML with what you end up with in the SC-formatted chains.

From the authors: "Users who wish additional records in their PDB files are urged to edit the source files and recompile the program."

ADD COMMENTlink written 8.2 years ago by Hanif Khalak1.2k

its just silly to convert them to another format just to get a small piece of data out when the data I need is right there.

ADD REPLYlink written 8.2 years ago by Will4.5k

I was thinking the code which generates the output in xml2pdb could point to the link between chain/atom

ADD REPLYlink written 8.2 years ago by Hanif Khalak1.2k
1
gravatar for Bilouweb
8.2 years ago by
Bilouweb1.1k
Saclay, France
Bilouweb1.1k wrote:

I found this explanation on the web site : ProtBuD

Here is the interesting part :

Parsing XML files.

PDBML (Westbrook, et al., 2005) is part of the uniformity project (Bhat,

et al., 2001) of the PDB. The PDB XML data files preserve the logical data model of the PDB Exchange Data Dictionary (Westbrook and Fitzgerald, 2003). Data can be retrieved quickly from XML files, and most software development environments provide libraries to read and write XML files. From the XML files, we retrieve the following data:

the entity_id and name for each type of molecule in the structure, for each entity_id, the asymmetric unit contents in terms of asym_ids; there may be several asym_ids for a given entity_id the biological unit contents consisting of symmetry operators applied to asym_ids for protein and nucleic acid polymers, the author chain IDs for each asym_id molecule to provide links with other databases such as PQS and SCOP that use the author chain IDs; the XML files provide the information that the author chain ID’s may be blank, but this information is only provided for polypeptide entities. information on covalent attachments and modified residues, defined in terms of asym_ids and residue numbers, and atom names structural determination data such as experiment type, space group, transformation matrices for converting to unit cell coordinates, missing residues, resolution, and R-factors We use the asym_ids in the XML files to link the required information for asymmetric and biological units. Since the biological units are defined in terms of the asym_ids and symmetry operators of the space group, the asym_ids are sufficient for defining both asymmetric and biological units in the XML files. However, ligands are not always assigned properly to specific biological units. Often when an asymmetric unit is broken up into more than one biological unit, all of the non-polymer ligands are assigned to the first unit. This is a limitation of the current state of the PDB and may be resolved in future releases of the PDB (J. Westbrook and H. Berman, personal communication). Covalent attachments and modified residues are identified uniquely in the XML files, and are connected to other data fields by asym_ids, residue numbers, and atom names. The categories used in our database are described in Table 1.

ADD COMMENTlink written 8.2 years ago by Bilouweb1.1k
0
gravatar for Payal
8.2 years ago by
Payal150
Payal150 wrote:

open(FH,"1atp.pdb"); while($a=<FH>) { if($a=~/^ATOM/) { $s=substr($a,21,1); if($s eq "I") { print FH1 $a; open(FH1,>>"ichain.pdb") } }}

ADD COMMENTlink written 8.2 years ago by Payal150

Can you please add some context... this is Perl, right?

ADD REPLYlink written 8.2 years ago by Egon Willighagen5.2k

-1 Looks fake to me...

ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k

This is for the pdb flat-file PDB format. @Will - I wish you did not loose those.

ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k

-1 for code that does not work

$ echo 'open(FH,"1BIS.pdb"); while($a=) { if($a=~/^ATOM/) { $s=substr($a,21,1); if($s eq "I") { print FH1 $a; open(FH1,>>"ichain.pdb") } }}' | perl
syntax error at - line 1, near "=) "
syntax error at - line 1, near ",>>"
syntax error at - line 1, near "} }"`
ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k

-1 for code that does not work (1)syntax error at test.pl line 1, near "=) "; (2) syntax error at test.pl line 1, near ",>>"; (2) syntax error at test.pl line 1, near "} }".

ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k

-1 for code that does not work (1) syntax error near "=) "; (2) syntax error near ",>>"; (2) syntax error near "} }".

ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k

-1 for code that does not compile. Syntax error near "=)" and near ",>>".

ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k

-1 for code that does not compile. Syntax error near "=)" and near ",>>".

ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k

I'll change my rating to an Up vote once the code is fixed. This is what I initially wanted to before I checked if it actually works. Please consider looking at other languages like R, Python, or Ruby.

ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k

I'll change my rating to an Up vote once the code is fixed. This is what I initially wanted to do before I checked if your Perl code actually works. Please consider looking at other languages like R, Python, or Ruby.

ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k

If this code worked - it would be only for the flat-file PDBs.

ADD REPLYlink written 8.2 years ago by Aleksandr Levchuk3.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 720 users visited in the last hour