Question

Work-arounds for lack of Uniref for proteins in Uniprot proteomes now deemed redundant?

3

Entering edit mode

8.4 years ago

Richard Llewellyn ▴ 180

My work involves comparing similar prokaryotic organisms, and since Uniprot reduced their coverage by determining many proteomes 'redundant,' I can no longer rely on Uniref90 or Uniref50 to aid in clustering proteins based on sequence similarity. Apparently Uniref uses Uniprot, not Uniparc, as its domain.

It's important to note that Uniprot is making the redundancy determination on a proteome by proteome basis, not on a protein basis, so typically a handful of proteins that appear novel in each 'redundant' proteome cannot be found in Uniprot. They are, however, in Uniparc.

Currently my work-around involves clustering all these Uniparc but not Uniprot proteins separately -- the majority cluster with existing Uniref sequences, but many do not. I'm using Usearch from drive5. It works, but is time consuming and requires creating my own protein clusters.

I'm curious if others are dealing with a similar problem, and if they have found any community solutions.

uniref proteome uniprot genome uniparc • 2.4k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by Richard Llewellyn ▴ 180

Ram · Answer 1 · 2015-11-18

1

Entering edit mode

8.4 years ago

dankwc2000 ▴ 20

I'm also having issues with Uniprot's Reducing proteome redundancy program. I have first made my custom E.coli database based on UniprotKB before they rolled out their redundancy program earlier in the year. For my MS work I did my initial search on my database and later I was mapping back to Uniprot only realising the redundancy removal has made a lot of my hits redundant. I then made my custom non-redundant E. coli based on the UniRef100 just before they rolled out their most recent redundancy program again on the 24th July 2015. Yesterday I was mapping some of my most recent searches based on my UniRef100 database back to Uniprot and I have some redundant/obsolete entries.

Right now I am manually going through each of the redundant/obsolete entries and match it with another protein within the same cluster in UniParc.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by dankwc2000 ▴ 20

0

Entering edit mode

Thanks for your perspective.

For those of us interested in a complete as possible picture of prokaryotic protein space, I think we now have to move to Uniparc as the primary reference. It might not be your case, but a significant number of the proteins made 'redundant' actually have no clear homolog in Uniprot, emphasizing that the judgement to redundantize is done on a genome by genome basis.

Unfortunately, Uniparc is not as well supported. Most of the reference mapping between Uniparc and other databases is done thru Uniprot, so if a protein has been redundatized, that Uniparc sequence cannot be easily mapped. There is a 60GB xml (!!!) file, uniparc_match, that is of some use.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Richard Llewellyn ▴ 180