Question: Work-arounds for lack of Uniref for proteins in Uniprot proteomes now deemed redundant?
gravatar for Richard Llewellyn
3.6 years ago by
United States
Richard Llewellyn170 wrote:

My work involves comparing similar prokaryotic organisms, and since Uniprot reduced their coverage by determining many proteomes 'redundant,' I can no longer rely on Uniref90 or Uniref50 to aid in clustering proteins based on sequence similarity.  Apparently Uniref uses Uniprot, not Uniparc, as its domain.

It's important to note that Uniprot is making the redundancy determination on a proteome by proteome basis, not on a protein basis, so typically a handful of proteins that appear novel in each 'redundant' proteome cannot be found in Uniprot.  They are, however, in Uniparc.

Currently my work-around involves clustering all these Uniparc but not Uniprot proteins separately -- the majority cluster with existing Uniref sequences, but many do not.  I'm using Usearch from drive5.  It works, but is time consuming and requires creating my own protein clusters.

I'm curious if others are dealing with a similar problem, and if they have found any community solutions.

ADD COMMENTlink modified 3.6 years ago by dankwc200020 • written 3.6 years ago by Richard Llewellyn170
gravatar for dankwc2000
3.6 years ago by
United Kingdom
dankwc200020 wrote:

I'm also having issues with Uniprot's Reducing proteome redundancy program.  I have first made my custom E.coli database based on UniprotKB before they rolled out their redundancy program earlier in the year.  For my MS work I did my initial search on my database and later I was mapping back to Uniprot only realising the redundancy removal has made a lot of my hits redundant.  I then made my custom non-redundant E. coli based on the UniRef100 just before they rolled out their most recent redundancy program again on the 24th July 2015. Yesterday I was mapping some of my most recent searches based on my UniRef100 database back to Uniprot and I have some redundant/obsolete entries. 

Right now I am manually going through each of the redundant/obsolete entries and match it with another protein within the same cluster in UniParc.

ADD COMMENTlink written 3.6 years ago by dankwc200020

Thanks for your perspective.

For those of us interested in a complete as possible picture of prokaryotic protein space, I think we now have to move to Uniparc as the primary reference.  It might not be your case, but a significant number of the proteins made 'redundant' actually have no clear homolog in Uniprot, emphasizing that the judgement to redundantize is done on a genome by genome basis.

Unfortunately, Uniparc is not as well supported.  Most of the reference mapping between Uniparc and other databases is done thru Uniprot, so if a protein has been redundatized, that Uniparc sequence cannot be easily mapped.  There is a 60GB xml (!!!) file, uniparc_match, that is of some use.

ADD REPLYlink written 3.6 years ago by Richard Llewellyn170
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1513 users visited in the last hour