Question: Parsing Pdb To Find Taxonomy Information Of Chains
2
gravatar for Will
8.1 years ago by
Will4.5k
United States
Will4.5k wrote:

I'm trying to find examples of crystal structures where "cross-species" interactions are being represented. A few examples are 2R03, 3DCG, 1ZLA. My thought was to parse the PDB files and find instances where the "Source" of each polymer is different. I know this will produce plenty of false-positives but it should hopefully be good against false-negatives.

The problem I'm having is that I can't find the "Source" information in any of the downloadable file formats (in the ftp-data directory).

Does anyone know where this information is stored? I would prefer not to resort to HTML scraping of the search page ... but I will if needed.

Thanks

pdb • 1.8k views
ADD COMMENTlink modified 8.1 years ago by Pierre Lindenbaum122k • written 8.1 years ago by Will4.5k
3
gravatar for Khader Shameer
8.1 years ago by
Manhattan, NY
Khader Shameer18k wrote:

Individual protein chain level taxonomy information is available from SIFTS.

Please check the pdb_chain_taxonomy.lst

PDB CHAIN   TAX_ID      MOLECULE_TYPE       SCIENTIFIC_NAME
101m    A   9755        PROTEIN         Physeter catodon
102l    A   10665       PROTEIN         Enterobacteria phage T4
102m    A   9755        PROTEIN         Physeter catodon
103d    A           NUCLEIC ACID            
103d    B           NUCLEIC ACID            
103l    A   10665       PROTEIN         Enterobacteria phage T4
103m    A   9755        PROTEIN         Physeter catodon
104d    A           UNKNOWN         
104d    B           UNKNOWN         
104l    A   10665       PROTEIN         Enterobacteria phage T4
104l    B   10665       PROTEIN         Enterobacteria phage T4
ADD COMMENTlink written 8.1 years ago by Khader Shameer18k
0
gravatar for Leonor Palmeira
8.1 years ago by
Leonor Palmeira3.7k
Li├Ęge, Belgium
Leonor Palmeira3.7k wrote:

By briefly searching the PDB ftp site, I found that in the gzipped files stored in these directories, there is a SOURCE line containing the organism:

ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/

These gzipped files can also be found here: ftp://ftp.wwpdb.org/pub/pdb/data/structures/all/pdb/

Does this help?

ADD COMMENTlink written 8.1 years ago by Leonor Palmeira3.7k
0
gravatar for Pierre Lindenbaum
8.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

The following XSLT stylesheet extracts the content of a PDB XML record:

<xsl:stylesheet version="1.0" xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform"
    xmlns:PDBx="http://pdbml.pdb.org/schema/pdbx-v40.xsd"
    xmlns="http://pdbml.pdb.org/schema/pdbx-v40.xsd"
    >
<xsl:output method="text"/>

<xsl:template match="/">
<xsl:variable name="name" select="PDBx:datablock/@datablockName"/>
<xsl:text>#name genus   scientific_name taxonid
</xsl:text>
 <xsl:for-each select="/PDBx:datablock/PDBx:entity_src_genCategory/PDBx:entity_src_gen">
    <xsl:value-of select="$name"/>
        <xsl:text>  </xsl:text>
        <xsl:value-of select="PDBx:gene_src_genus"/>
        <xsl:text>  </xsl:text>
        <xsl:value-of select="PDBx:pdbx_gene_src_scientific_name"/>
        <xsl:text>  </xsl:text>
        <xsl:value-of select="PDBx:pdbx_gene_src_ncbi_taxonomy_id"/>
        <xsl:text>
</xsl:text>
    </xsl:for-each>
</xsl:template>

</xsl:stylesheet>

eg. with 3DCG

$ xsltproc stylesheet.xsl "http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=xml&compression=NO&structureId=3DCG"
#name   genus   scientific_name taxonid
3DCG        Homo sapiens    9606
3DCG        Homo sapiens    9606
3DCG        Human immunodeficiency virus type 1 (NEW YORK-5 ISOLATE)    11698
ADD COMMENTlink written 8.1 years ago by Pierre Lindenbaum122k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 993 users visited in the last hour