Question

Parsing residue numbers form PDB SIFTS XML files

0

Entering edit mode

8.3 years ago

bhadra4282 • 0

Dear All,

I have a list of Uniprot IDs. Based on these IDs I would like to parse the ordering of amino acid residues in the ATOM field of PDB structures. But ATOM field residue numbers do not always match with the order of residues in corresponding ResSeq numbers.After searching Biostars I found a post about SIFTS database.

But the residue number information in SIFTS database are in xml.gz files. I really don't know how to read these files using either R or Python.

I tried some solutions from Biostars itself .But they don't work in my case.I would like to give Uniport IDs (or PDB IDs) one bye one and parse the xml files to get the residues numbers in PDB and corresponding residue number in Res Seq field.

If appreciate suggestions from both R and Python experts, because I would like to know both approaches.

Link to SFITS database: https://www.ebi.ac.uk/pdbe/docs/sifts/quick.html

Following is the xml file repository: ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/

Thank you in advance

PDB R SFITS XML PYTHON • 3.1k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.3 years ago by bhadra4282 • 0

0

Entering edit mode

give us an example please.

ADD REPLY • link 8.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

"I tried some solutions from Biostars itself .But they don't work in my case." : what have you tried ?

ADD REPLY • link 8.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

tmpdir <- tempdir()

url <- 'ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/cr/1crn.xml.gz'
file <- basename(url)
download.file(url, file)

untar(file, compressed = 'gzip', exdir = tmpdir )
list.files(tmpdir)

This is R code.But it gives error.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.3 years ago by bhadra4282 • 0

0

Entering edit mode

and Python I don't know which module is good for me.I haven't ever parsed xml files.I found Beautiful Soup.I first tried the examples to learn it. But first of all I don't know how to read these files one by one.Hope that would be a great help.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.3 years ago by bhadra4282 • 0

0

Entering edit mode

For example I have two structures IGK9 and ICRN in my file (I have around 1200 structures in fact).

In SWIFTS database they can be found as:

Individual PDB entry data can either be found in a path like this:

ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/xml/1xyz.xml.gz - where 1xyz is the PDB code or in a path like this: (So here Ixyz canbe 1crn or 1gk9 etc)

ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/xy/1xyz.xml.gz - where 'xy' are the second and third characters of the PDB code and 1xyz is the PDB code itself.(here xy will be cr ,gk etc)

Hope this is what you asked for.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.3 years ago by bhadra4282 • 0

0

Entering edit mode

I meant what would be the expected ouput for "I would like to parse the ordering of amino acid residues in the ATOM field of PDB structures. " for ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/xy/1xyz.xml.gz

ADD REPLY • link 8.3 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

The xml file contains information like below:

<entity type="protein" entityId="A">
    <segment segId="1crn_A_1_46" start="1" end="46">
      <listResidue>
        <residue dbSource="PDBe" dbCoordSys="PDBe" dbResNum="1" dbResName="THR">
          <crossRefDb dbSource="PDB" dbCoordSys="PDBresnum" dbAccessionId="1crn" dbResNum="1" dbResName="THR" dbChainId="A"/>
          <crossRefDb dbSource="UniProt" dbCoordSys="UniProt" dbAccessionId="P01542" dbResNum="1" dbResName="T"/>
          <crossRefDb dbSource="CATH" dbCoordSys="PDBresnum" dbAccessionId="3.30.1350.10" dbResNum="1" dbResName="THR" dbChainId="A"/>
          <crossRefDb dbSource="SCOP" dbCoordSys="PDBresnum" dbAccessionId="44622" dbResNum="1" dbResName="THR" dbChainId="A"/>
          <crossRefDb dbSource="NCBI" dbCoordSys="UniProt" dbAccessionId="3721" dbResNum="1" dbResName="T"/>
          <residueDetail dbSource="PDBe" property="codeSecondaryStructure">T</residueDetail>
          <residueDetail dbSource="PDBe" property="nameSecondaryStructure">loop</residueDetail>
        </residue>

Expected output:I would like to have them in tab/comma separated fomat(columns).

Row name showing SegId (This will be same for a single chain in a structure). Then dbResnum, dbResname, etc (all entities in crossRefDb fields) as columns inorder if any of these is not existing then "NA".

I think now Ihave answered your question.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.3 years ago by bhadra4282 • 0