Fetching Domain Summary Information From Pfam In Bulk
2
3
Entering edit mode
13.2 years ago
Ananth ▴ 70

Hi,

I'm interested in fetching the domain summary information from Pfam for a large number of protein families (around 3000). I am particularly looking for the organism related information present in the summary. For eg: DUF1013 is present in proteobacteria.

http://pfam.sanger.ac.uk/family?acc=PF06242

As I need this information for a large number of proteins, is there a way to download this info from Pfam ?

domain • 4.7k views
ADD COMMENT
6
Entering edit mode
13.2 years ago

Did you try looking at the documentation of their REST API? A simple query like this can retrieve the information you want for one domain in a simple XML format:

http://pfam.sanger.ac.uk/family/PF06242?output=xml

All you have to do is to construct a ilst of such URLs for all your domains of interest and use curl or wget to fetch them.

ADD COMMENT
0
Entering edit mode

Thank you very much.

ADD REPLY
4
Entering edit mode
13.2 years ago

To complete Lars' answer, You can use the following XSLT stylesheet to directly extract the information you need:


<xsl:stylesheet xmlns:xsl="&lt;a href="http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
   xmlns:p="http://pfam.sanger.ac.uk/"
   version='1.0'
   >
<xsl:output method="text" encoding="UTF-8" indent="yes"/>

<xsl:template match="/">
<xsl:for-each select="p:pfam/p:entry">
<xsl:value-of select="@accession"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="normalize-space(p:comment)"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>

eg:

for F in PF06141 PF03141 PF06242; do xsltproc stylesheet.xsl "http://pfam.sanger.ac.uk/family/${F}?output=xml"; done
PF06141  Tail fibre component U of bacteriophage.
PF03141  Members of this family of hypothetical plant proteins are probably methyltransferases: several of the aligned sequences either match the methyltransferase profile Profile:PS50124, or contain a SAM-binding motif Profile:PS50193. Swiss:Q9ZQ84 contains both. Several family members are described as ankyrin like.
PF06242  Family of uncharacterised proteins found in Proteobacteria.
ADD COMMENT

Login before adding your answer.

Traffic: 2739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6