Question: Parsing residue numbers form PDB SIFTS XML files
0
gravatar for bhadra4282
3.5 years ago by
bhadra42820
bhadra42820 wrote:

Dear All,

I have a list of Uniprot IDs. Based on these IDs I would like to parse the ordering of amino acid residues in the ATOM field of PDB structures. But ATOM field residue numbers do not always match with the order of residues in corresponding ResSeq numbers.After searching Biostars I found a post about SIFTS database.

But the residue number information in SIFTS database are in xml.gz files. I really don't know how to read these files  using either R or  Python.

I tried some solutions from Biostars itself .But they don't work in my case.I would like to give Uniport IDs (or PDB IDs) one bye one and parse the xml files to get the residues numbers in PDB and corresponding residue number in Res Seq field.

If appreciate suggestions from both R and Python experts, because I would like to know both approaches.

link to SFITS database :https://www.ebi.ac.uk/pdbe/docs/sifts/quick.html

Following is the xml file repository: ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/

Thank you in advance.

 

python R sfits xml pdb • 1.5k views
ADD COMMENTlink modified 2.8 years ago by Biostar ♦♦ 20 • written 3.5 years ago by bhadra42820

give us an example please.

ADD REPLYlink written 3.5 years ago by Pierre Lindenbaum121k

"I tried some solutions from Biostars itself .But they don't work in my case." : what have you tried ?

ADD REPLYlink written 3.5 years ago by Pierre Lindenbaum121k

tmpdir <- tempdir()

url <- 'ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/cr/1crn.xml.gz'
file <- basename(url)
download.file(url, file)

untar(file, compressed = 'gzip', exdir = tmpdir )
list.files(tmpdir)

This is R code.But it gives error.

 

ADD REPLYlink written 3.5 years ago by bhadra42820

and Python I don't know which module is  good for me.I haven't ever parsed xml files.I found Beautiful Soup.I first tried the examples to learn it. But first of all I don't know how to read these files one by one.Hope that would be a great help.

ADD REPLYlink written 3.5 years ago by bhadra42820

For example I have two structures IGK9 and ICRN in my file (I have around 1200 structures in fact).

In SWIFTS database they can be found as:

Individual PDB entry data can either be found in a path like this: 
ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/xml/1xyz.xml.gz - where 1xyz is the PDB code 
or in a path like this: (So here Ixyz canbe 1crn or 1gk9 etc)
ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/xy/1xyz.xml.gz - where 'xy' are the second and third characters of the PDB code and 1xyz is the PDB code itself.(here xy will be cr ,gk etc)

Hope this is what you asked for.

ADD REPLYlink written 3.5 years ago by bhadra42820

I meant what would be the expected ouput for "I would like to parse the ordering of amino acid residues in the ATOM field of PDB structures. " for ftp://ftp.ebi.ac.uk/pub/databases/msd/sifts/split_xml/xy/1xyz.xml.gz

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Pierre Lindenbaum121k

The xml file contains information like below: 

<entity type="protein" entityId="A">
    <segment segId="1crn_A_1_46" start="1" end="46">
      <listResidue>
        <residue dbSource="PDBe" dbCoordSys="PDBe" dbResNum="1" dbResName="THR">
          <crossRefDb dbSource="PDB" dbCoordSys="PDBresnum" dbAccessionId="1crn" dbResNum="1" dbResName="THR" dbChainId="A"/>
          <crossRefDb dbSource="UniProt" dbCoordSys="UniProt" dbAccessionId="P01542" dbResNum="1" dbResName="T"/>
          <crossRefDb dbSource="CATH" dbCoordSys="PDBresnum" dbAccessionId="3.30.1350.10" dbResNum="1" dbResName="THR" dbChainId="A"/>
          <crossRefDb dbSource="SCOP" dbCoordSys="PDBresnum" dbAccessionId="44622" dbResNum="1" dbResName="THR" dbChainId="A"/>
          <crossRefDb dbSource="NCBI" dbCoordSys="UniProt" dbAccessionId="3721" dbResNum="1" dbResName="T"/>
          <residueDetail dbSource="PDBe" property="codeSecondaryStructure">T</residueDetail>
          <residueDetail dbSource="PDBe" property="nameSecondaryStructure">loop</residueDetail>
        </residue>

Expected output:I would like to have them in tab/comma separated fomat(columns).

Row name showing SegId (This will be same for a single chain in a structure). Then dbResnum, dbResname, etc (all entities in crossRefDb fields) as columns inorder if any of these is not existing then "NA".

I think now Ihave answered your question.

 

 

 

ADD REPLYlink written 3.5 years ago by bhadra42820
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 814 users visited in the last hour