My work involves comparing similar prokaryotic organisms, and since Uniprot reduced their coverage by determining many proteomes 'redundant,' I can no longer rely on Uniref90 or Uniref50 to aid in clustering proteins based on sequence similarity. Apparently Uniref uses Uniprot, not Uniparc, as its domain.
It's important to note that Uniprot is making the redundancy determination on a proteome by proteome basis, not on a protein basis, so typically a handful of proteins that appear novel in each 'redundant' proteome cannot be found in Uniprot. They are, however, in Uniparc.
Currently my work-around involves clustering all these Uniparc but not Uniprot proteins separately -- the majority cluster with existing Uniref sequences, but many do not. I'm using Usearch from drive5. It works, but is time consuming and requires creating my own protein clusters.
I'm curious if others are dealing with a similar problem, and if they have found any community solutions.