2
0
Entering edit mode
8.2 years ago
wilmijntje • 0

I downloaded Uniprot files of a group of proteins (n>1000, so manually checking these proteins is no option). The complete data files come as either a flat text file or a XML file. There is a lot of information present in these files (for an example, see here: http://www.uniprot.org/uniprot/?query=organism%3A%22homo+sapiens%22 then go to download and you can look at the first 10 for the complete data, either txt or xml file).

Since there is a lot of information in their I do not need, I have to find a way how to select the information I'm interested in (preferably in a data matrix). For every entry this is:

Wanted information:                 Text file entry:    XML file entry:
Uniprot ID                          ID                  <entry><name>
Gene Name                           GN                  <gene><name type="primary">
Full protein name                   RecName:            <protein><recommendedName><fullName>
Transmembrane domains (may be more) TRANSMEM            <feature type="transmembrane region"><location> This consists of <begin position="xxx"/> and <end position="yyy"/>
Full protein sequence               SQ                  <sequence>


Some entries will not contain all the information (like transmembrane domains), and then a NA might be filled in. Some entries will contain more than 1 time information of the same kind (again like transmembrane domains) and for these, all should be named (if possible in the same cell, separated by "," or ";" or "|").

I am a bit familial with R, but I wasn't able to get to this point with that (might be lack of programming skills). I looked into XML editors (since this seems to be the easiest solution), but I wasn't able to get any to work, I simply couldn't find something that helped me on my way and explained the different steps. I also know that there should be a way to process XML files in R, but the help files there didn't get me where I need to be either. In XMLQuire, the only thing I could download so far, I'm able to see the file, but it keeps crashing on me when I want to do anything (even when I'm just trying to figure out where I can edit the file), so my file might be too long or there's another problem.

Help for this matter would highly be appreciated, I'm hoping to find someone who did a similar thing, but all solutions are welcome, no matter how small and no matter which program I need to use as long as it's freeware.

Also let me know if things are unclear, I really try to be as clear as possible. And sorry for being such a blondie on the subject.

r xml uniprot • 3.3k views
1
Entering edit mode
8.2 years ago

Use XSLT to transform the XML into what you need. e.g:


<xsl:stylesheet xmlns:xsl="&lt;a href="http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform'
xmlns:u="http://uniprot.org/uniprot"
version='1.0'
>

<xsl:output method="text" encoding="UTF-8"/>
<xsl:param name="temporary">temporary</xsl:param>

<xsl:template match="/">
<xsl:text>Wanted information    Text file entry    XML file entry:
</xsl:text>
<xsl:apply-templates select="u:uniprot"/>
</xsl:template>

<xsl:template match="u:uniprot">
<xsl:apply-templates select="u:entry"/>
</xsl:template>

<xsl:template match="u:entry">
<xsl:apply-templates select="u:name"/>
<xsl:apply-templates select="u:gene"/>
<xsl:apply-templates select="u:protein"/>
<xsl:text>//
</xsl:text>
</xsl:template>

<xsl:template match="u:gene">
<xsl:for-each select="u:name">
<xsl:text>Gene Name    GN    </xsl:text>
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

<xsl:template match="u:name">
<xsl:text>Uniprot ID    ID    </xsl:text>
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:template>

<xsl:template match="u:protein">
<xsl:for-each select="u:alternativeName/u:fullName">
<xsl:text>Full protein name    RecName    </xsl:text>
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>

</xsl:stylesheet>


eg:

\$ xsltproc stylesheet.xsl entry.xml

Wanted information    Text file entry    XML file entry:
Uniprot ID    ID    AKT1_HUMAN
Gene Name    GN    AKT1
Gene Name    GN    PKB
Gene Name    GN    RAC
Full protein name    RecName    Protein kinase B
Full protein name    RecName    Protein kinase B alpha
Full protein name    RecName    Proto-oncogene c-Akt
Full protein name    RecName    RAC-PK-alpha
//

0
Entering edit mode

there is a problem with the XSLT formatting on #biostar.

0
Entering edit mode

their is another issue when i post a comment or any new post it will be duplicated and i need to remove on of them

0
Entering edit mode

Thank you for the response. But where do I put this code in?

0
Entering edit mode
8.2 years ago

If tab-delimited format is an option, you can use the UniProt web site to obtain this data (programmatically or interactively).

Run your query (text search or batch retrieve), then click on "Customize" to add columns for the sequence and for the transmembrane domains. To do the latter, click on "Sequence annotation (features)", then "Show", then select "Transmembrane" and click again on "Show". Remove ("hide") all columns you are not interested in.

The only detail that cannot be obtained exactly as you like is the fact that you want the "Recommended Name" only. Please note that UniProtKB/TrEMBL entries usually do not have "Recommended Names", but "Submitted Names".

For a preview of the first 10 entries from the complete human proteome: http://www.uniprot.org/uniprot/?query=organism%3a9606+keyword%3a181&sort=score&limit=10&format=tab&columns=identry%20namegenesprotein%20namessequencefeature%28TRANSMEMBRANE%29

To see only the primary gene name and omit all synonyms. ordered locus names and orf names, you can replace "genes" by "genes(PREFERRED)" in the above URL.