Question: How To Edit Information Of Uniprot Downloads (Either Txt Or Xml)
gravatar for wilmijntje
7.4 years ago by
wilmijntje0 wrote:

I downloaded Uniprot files of a group of proteins (n>1000, so manually checking these proteins is no option). The complete data files come as either a flat text file or a XML file. There is a lot of information present in these files (for an example, see here: then go to download and you can look at the first 10 for the complete data, either txt or xml file).

Since there is a lot of information in their I do not need, I have to find a way how to select the information I'm interested in (preferably in a data matrix). For every entry this is:

Wanted information:                 Text file entry:    XML file entry:
Uniprot ID                          ID                  <entry><name>
Gene Name                           GN                  <gene><name type="primary">
Full protein name                   RecName:            <protein><recommendedName><fullName>
Transmembrane domains (may be more) TRANSMEM            <feature type="transmembrane region"><location> This consists of <begin position="xxx"/> and <end position="yyy"/>
Full protein sequence               SQ                  <sequence>

Some entries will not contain all the information (like transmembrane domains), and then a NA might be filled in. Some entries will contain more than 1 time information of the same kind (again like transmembrane domains) and for these, all should be named (if possible in the same cell, separated by "," or ";" or "|").

I am a bit familial with R, but I wasn't able to get to this point with that (might be lack of programming skills). I looked into XML editors (since this seems to be the easiest solution), but I wasn't able to get any to work, I simply couldn't find something that helped me on my way and explained the different steps. I also know that there should be a way to process XML files in R, but the help files there didn't get me where I need to be either. In XMLQuire, the only thing I could download so far, I'm able to see the file, but it keeps crashing on me when I want to do anything (even when I'm just trying to figure out where I can edit the file), so my file might be too long or there's another problem.

Help for this matter would highly be appreciated, I'm hoping to find someone who did a similar thing, but all solutions are welcome, no matter how small and no matter which program I need to use as long as it's freeware.

Also let me know if things are unclear, I really try to be as clear as possible. And sorry for being such a blondie on the subject.

R uniprot xml • 3.0k views
ADD COMMENTlink modified 7.4 years ago by Elisabeth Gasteiger1.7k • written 7.4 years ago by wilmijntje0
gravatar for Pierre Lindenbaum
7.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum128k wrote:

Use XSLT to transform the XML into what you need. e.g:

<xsl:stylesheet xmlns:xsl="&lt;a href="" "="" rel="nofollow">'

<xsl:output method="text" encoding="UTF-8"/>
<xsl:param name="temporary">temporary</xsl:param>

<xsl:template match="/">
<xsl:text>Wanted information    Text file entry    XML file entry:
<xsl:apply-templates select="u:uniprot"/>

<xsl:template match="u:uniprot">
<xsl:apply-templates select="u:entry"/>

<xsl:template match="u:entry">
<xsl:apply-templates select="u:name"/>
<xsl:apply-templates select="u:gene"/>
<xsl:apply-templates select="u:protein"/>

<xsl:template match="u:gene">
<xsl:for-each select="u:name">
<xsl:text>Gene Name    GN    </xsl:text>
<xsl:value-of select="."/>

<xsl:template match="u:name">
<xsl:text>Uniprot ID    ID    </xsl:text>
<xsl:value-of select="."/>

<xsl:template match="u:protein">
<xsl:for-each select="u:alternativeName/u:fullName">
<xsl:text>Full protein name    RecName    </xsl:text>
<xsl:value-of select="."/>



$ xsltproc stylesheet.xsl entry.xml

Wanted information    Text file entry    XML file entry:
Uniprot ID    ID    AKT1_HUMAN
Gene Name    GN    AKT1
Gene Name    GN    PKB
Gene Name    GN    RAC
Full protein name    RecName    Protein kinase B
Full protein name    RecName    Protein kinase B alpha
Full protein name    RecName    Proto-oncogene c-Akt
Full protein name    RecName    RAC-PK-alpha
ADD COMMENTlink written 7.4 years ago by Pierre Lindenbaum128k

there is a problem with the XSLT formatting on #biostar.

ADD REPLYlink written 7.4 years ago by Pierre Lindenbaum128k

their is another issue when i post a comment or any new post it will be duplicated and i need to remove on of them

ADD REPLYlink modified 7.4 years ago • written 7.4 years ago by Medhat8.7k

Thank you for the response. But where do I put this code in?

ADD REPLYlink written 7.4 years ago by wilmijntje0
gravatar for Elisabeth Gasteiger
7.4 years ago by
Elisabeth Gasteiger1.7k wrote:

If tab-delimited format is an option, you can use the UniProt web site to obtain this data (programmatically or interactively).

Run your query (text search or batch retrieve), then click on "Customize" to add columns for the sequence and for the transmembrane domains. To do the latter, click on "Sequence annotation (features)", then "Show", then select "Transmembrane" and click again on "Show". Remove ("hide") all columns you are not interested in.

Once you are happy with your format, click on "Download" and select the tab-delimited format.

The only detail that cannot be obtained exactly as you like is the fact that you want the "Recommended Name" only. Please note that UniProtKB/TrEMBL entries usually do not have "Recommended Names", but "Submitted Names".

For a preview of the first 10 entries from the complete human proteome:

To see only the primary gene name and omit all synonyms. ordered locus names and orf names, you can replace "genes" by "genes(PREFERRED)" in the above URL.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Elisabeth Gasteiger1.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 798 users visited in the last hour