Question

Problematic Representantive Members Found In Uniref 100/90/50 Xml Files

5

Entering edit mode

12.5 years ago

Pablo Pareja ★ 1.6k

Hi,

I'm currently working on preparing a new release of Bio4j and I just ran into a problem with Uniref 100/90/50 XML files.

I found out how there are a lot of representative members that don't comply with the expected XML syntax, because either things are specified in a different way or there's more or less info missing.

Supposably this is how things should look like:

<entry id="UniRef100_P99999" updated="2005-02-01">
 <name>Cytochrome c</name>
 <representativeMember>
   <dbReference type="UniProtKB ID" id="CYC_HUMAN" >
     <property type="UniProtKB accession" value="P99999" />
     <property type="UniProtKB accession" value="P00001" />
     <property type="UniProtKB accession" value="Q6NUR2" />
     <property type="UniProtKB accession" value="Q6NX69" />
     <property type="UniProtKB accession" value="Q96BV4" />
     <property type="UniParc ID" value="UPI0000128BBF" />
     <property type="UniRef90 ID" value="UniRef90_P99999"/>
     <property type="UniRef50 ID" value="UniRef50_P99999"/>
     <property type="protein name" value="Cytochrome c" />
     <property type="NCBI taxonomy" value="9606" />
     <property type="source organism" value="Homo sapiens" />
     <property type="length" value="104" />
     <property type="overlap region" value="2-105" />
   </dbReference>
   <sequence length="104" checksum="D47C9B513DF1C5C2">
     GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWG
     EDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE
   </sequence>
 </representativeMember>
 </entry>

And this is a sample of the weird ones I've found:

<entry id="UniRef100_UPI000194DDB6" updated="2011-10-19">
  <name>Cluster: UPI000194DDB6 UniRef100 entry</name>
  <property type="member count" value="1"/>
  <property type="common taxon" value="root"/>
  <property type="common taxon ID" value="1"/>
  <representativeMember>
    <dbReference type="UniParc ID" id="UPI000194DDB6">
     <property type="UniRef90 ID" value="UniRef90_UPI0000E816FE"/>
     <property type="UniRef50 ID" value="UniRef50_Q92625"/>
     <property type="length" value="1128"/>
     <property type="isSeed" value="true"/>
    </dbReference>
    <sequence length="1128" checksum="30DD0A7E86C8660E">
    ....

Where as you can see there's no information about the protein uniprot accession, (even not the protein name either)

In total I have found 965.244 entries with some sort of problem/info missing only in Uniref100, (you can find them in this txt file ).

Do you have any idea of why this may be happening? Are all these ids related somehow?

I'd really appreciate any feedback.

Cheers,

Pablo Pareja

uniprot xml • 4.2k views

ADD COMMENT • link updated 12.5 years ago by Pmcgarvey ▴ 20 • written 12.5 years ago by Pablo Pareja ★ 1.6k

score 3 · Answer 1 · 2011-11-08

3

Entering edit mode

12.5 years ago

Raquel Tobes ▴ 160

In this Uniprot page (http://www.uniprot.org/help/uniref) you can find this information:

"In addition to UniProtKB records, UniRef100 also includes the UniParc entries that are not covered by UniProtKB and contain cross-references to the following databases: •Ensembl Chicken •Ensembl Cow •Ensembl Dog •Ensembl Fly •Ensembl Fugu •Ensembl Human •Ensembl Mouse •Ensembl Tetraodon •Ensembl Rat •Ensembl Xenopus •Ensembl Zebrafish •Refseq •PDB"

The example of the "rare" entry corresponds to UniParc and its provenance is RefSeq. The reason why these sequences remain as representatives could be related to the algorithm of construction of Uniref100. I think that the deletion of a representative could be complicated and have many collateral effects.

ADD COMMENT • link 12.5 years ago by Raquel Tobes ▴ 160

0

Entering edit mode

You are correct, these are all UniParc only clusters. The UniParc sequence is representative as its the only sequence in the cluster. i.e. there is always a representative sequence in a cluster.

ADD REPLY • link 12.5 years ago by Jerven ▴ 660

0

Entering edit mode

@Raquel Tobes & @jerven thanks for the info ;)

ADD REPLY • link 12.5 years ago by Pablo Pareja ★ 1.6k

score 2 · Answer 2 · 2011-11-09

Pablo

Hi, The uniref representative in question is from UniParc.

http://www.uniprot.org/uniparc/UPI000194DDB6

UniRef included more than UniProtKB, it includes selected sequences from RefSeq like this one and also Ensembl. If there was a better member of the cluster (like one from UniProtKB) it would be selected as representative but in many cases there are no such members. The missing information does not exist in UniParc so it cannot be provided.

For a complete list of sources and description of how we select a representative sequence see this page or our manuscript. http://www.uniprot.org/help/uniref

Our goal is to move more of the RefSeq and Ensembl entries into UniProtKB in which case the number of UniParc entries in UniRef will decline.

Hope this helps. Thanks for using UniRef and UniProt.

Peter UniProt Consortium