Parsing Pdb To Find Taxonomy Information Of Chains
3
2
Entering edit mode
12.8 years ago
Will 4.5k

I'm trying to find examples of crystal structures where "cross-species" interactions are being represented. A few examples are 2R03, 3DCG, 1ZLA. My thought was to parse the PDB files and find instances where the "Source" of each polymer is different. I know this will produce plenty of false-positives but it should hopefully be good against false-negatives.

The problem I'm having is that I can't find the "Source" information in any of the downloadable file formats (in the ftp-data directory).

Does anyone know where this information is stored? I would prefer not to resort to HTML scraping of the search page ... but I will if needed.

Thanks

pdb • 2.6k views
ADD COMMENT
3
Entering edit mode
12.8 years ago

Individual protein chain level taxonomy information is available from SIFTS.

Please check the pdb_chain_taxonomy.lst

PDB CHAIN   TAX_ID      MOLECULE_TYPE       SCIENTIFIC_NAME
101m    A   9755        PROTEIN         Physeter catodon
102l    A   10665       PROTEIN         Enterobacteria phage T4
102m    A   9755        PROTEIN         Physeter catodon
103d    A           NUCLEIC ACID            
103d    B           NUCLEIC ACID            
103l    A   10665       PROTEIN         Enterobacteria phage T4
103m    A   9755        PROTEIN         Physeter catodon
104d    A           UNKNOWN         
104d    B           UNKNOWN         
104l    A   10665       PROTEIN         Enterobacteria phage T4
104l    B   10665       PROTEIN         Enterobacteria phage T4
ADD COMMENT
0
Entering edit mode
12.8 years ago

By briefly searching the PDB ftp site, I found that in the gzipped files stored in these directories, there is a SOURCE line containing the organism:

ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/

These gzipped files can also be found here: ftp://ftp.wwpdb.org/pub/pdb/data/structures/all/pdb/

Does this help?

ADD COMMENT
0
Entering edit mode
12.8 years ago

The following XSLT stylesheet extracts the content of a PDB XML record:

<xsl:stylesheet version="1.0" xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform"
    xmlns:PDBx="http://pdbml.pdb.org/schema/pdbx-v40.xsd"
    xmlns="http://pdbml.pdb.org/schema/pdbx-v40.xsd"
    >
<xsl:output method="text"/>

<xsl:template match="/">
<xsl:variable name="name" select="PDBx:datablock/@datablockName"/>
<xsl:text>#name genus   scientific_name taxonid
</xsl:text>
 <xsl:for-each select="/PDBx:datablock/PDBx:entity_src_genCategory/PDBx:entity_src_gen">
    <xsl:value-of select="$name"/>
        <xsl:text>  </xsl:text>
        <xsl:value-of select="PDBx:gene_src_genus"/>
        <xsl:text>  </xsl:text>
        <xsl:value-of select="PDBx:pdbx_gene_src_scientific_name"/>
        <xsl:text>  </xsl:text>
        <xsl:value-of select="PDBx:pdbx_gene_src_ncbi_taxonomy_id"/>
        <xsl:text>
</xsl:text>
    </xsl:for-each>
</xsl:template>

</xsl:stylesheet>

eg. with 3DCG

$ xsltproc stylesheet.xsl "http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=xml&compression=NO&structureId=3DCG"
#name   genus   scientific_name taxonid
3DCG        Homo sapiens    9606
3DCG        Homo sapiens    9606
3DCG        Human immunodeficiency virus type 1 (NEW YORK-5 ISOLATE)    11698
ADD COMMENT

Login before adding your answer.

Traffic: 2739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6