Question: Problematic Representantive Members Found In Uniref 100/90/50 Xml Files
2
gravatar for Pablo Pareja
7.5 years ago by
Pablo Pareja1.6k
Granada, Spain
Pablo Pareja1.6k wrote:

Hi,

I'm currently working on preparing a new release of Bio4j and I just ran into a problem with Uniref 100/90/50 XML files.

I found out how there are a lot of representative members that don't comply with the expected XML syntax, because either things are specified in a different way or there's more or less info missing.

Supposably this is how things should look like:

<entry id="UniRef100_P99999" updated="2005-02-01">
 <name>Cytochrome c</name>
 <representativeMember>
   <dbReference type="UniProtKB ID" id="CYC_HUMAN" >
     <property type="UniProtKB accession" value="P99999" />
     <property type="UniProtKB accession" value="P00001" />
     <property type="UniProtKB accession" value="Q6NUR2" />
     <property type="UniProtKB accession" value="Q6NX69" />
     <property type="UniProtKB accession" value="Q96BV4" />
     <property type="UniParc ID" value="UPI0000128BBF" />
     <property type="UniRef90 ID" value="UniRef90_P99999"/>
     <property type="UniRef50 ID" value="UniRef50_P99999"/>
     <property type="protein name" value="Cytochrome c" />
     <property type="NCBI taxonomy" value="9606" />
     <property type="source organism" value="Homo sapiens" />
     <property type="length" value="104" />
     <property type="overlap region" value="2-105" />
   </dbReference>
   <sequence length="104" checksum="D47C9B513DF1C5C2">
     GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWG
     EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
   </sequence>
 </representativeMember>
 </entry>

And this is a sample of the weird ones I've found:

<entry id="UniRef100_UPI000194DDB6" updated="2011-10-19">
  <name>Cluster: UPI000194DDB6 UniRef100 entry</name>
  <property type="member count" value="1"/>
  <property type="common taxon" value="root"/>
  <property type="common taxon ID" value="1"/>
  <representativeMember>
    <dbReference type="UniParc ID" id="UPI000194DDB6">
     <property type="UniRef90 ID" value="UniRef90_UPI0000E816FE"/>
     <property type="UniRef50 ID" value="UniRef50_Q92625"/>
     <property type="length" value="1128"/>
     <property type="isSeed" value="true"/>
    </dbReference>
    <sequence length="1128" checksum="30DD0A7E86C8660E">
    ....

Where as you can see there's no information about the protein uniprot accession, (even not the protein name either)

In total I have found 965.244 entries with some sort of problem/info missing only in Uniref100, (you can find them in this txt file ).

Do you have any idea of why this may be happening? Are all these ids related somehow?

I'd really appreciate any feedback.

Cheers,

Pablo Pareja

xml uniprot • 2.4k views
ADD COMMENTlink written 7.5 years ago by Pablo Pareja1.6k
3
gravatar for Raquel Tobes
7.5 years ago by
Raquel Tobes140
Spain
Raquel Tobes140 wrote:

In this Uniprot page (http://www.uniprot.org/help/uniref) you can find this information:

"In addition to UniProtKB records, UniRef100 also includes the UniParc entries that are not covered by UniProtKB and contain cross-references to the following databases: •Ensembl Chicken •Ensembl Cow •Ensembl Dog •Ensembl Fly •Ensembl Fugu •Ensembl Human •Ensembl Mouse •Ensembl Tetraodon •Ensembl Rat •Ensembl Xenopus •Ensembl Zebrafish •Refseq •PDB"

The example of the "rare" entry corresponds to UniParc and its provenance is RefSeq. The reason why these sequences remain as representatives could be related to the algorithm of construction of Uniref100. I think that the deletion of a representative could be complicated and have many collateral effects.

ADD COMMENTlink written 7.5 years ago by Raquel Tobes140

You are correct, these are all UniParc only clusters. The UniParc sequence is representative as its the only sequence in the cluster. i.e. there is always a representative sequence in a cluster.

ADD REPLYlink written 7.5 years ago by Jerven640

@Raquel Tobes & @jerven thanks for the info ;)

ADD REPLYlink written 7.5 years ago by Pablo Pareja1.6k
2
gravatar for Pmcgarvey
7.5 years ago by
Pmcgarvey20
Pmcgarvey20 wrote:

Pablo

Hi, The uniref representative in question is from UniParc.

http://www.uniprot.org/uniparc/UPI000194DDB6

UniRef included more than UniProtKB, it includes selected sequences from RefSeq like this one and also Ensembl. If there was a better member of the cluster (like one from UniProtKB) it would be selected as representative but in many cases there are no such members. The missing information does not exist in UniParc so it cannot be provided.

For a complete list of sources and description of how we select a representative sequence see this page or our manuscript. http://www.uniprot.org/help/uniref

Our goal is to move more of the RefSeq and Ensembl entries into UniProtKB in which case the number of UniParc entries in UniRef will decline.

Hope this helps. Thanks for using UniRef and UniProt.

Peter UniProt Consortium

ADD COMMENTlink written 7.5 years ago by Pmcgarvey20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1136 users visited in the last hour