How to "enrich" a large set of proteins (>11M) for protein hits in UniRef90?

0

Entering edit mode

17 months ago

O.rka ▴ 720

I have 11,341,241 proteins that I want to align to UniRef90 (most likely with Diamond but open to using MMSEQS2 or another tool if faster). I suspect many of these are not going to have a hit and it will be quite expensive to run all of the proteins against UniRef.

Is there a way to "enrich" my protein set for hits in UniRef90? Possibly by running with Diamond with low sensitivity and then again with higher sensitivity? What about running against UniRef50 and then UniRef90 with the hits? My end goal is to have a mapping between [id_protein] and [id_uniref90]. A smaller "enriched" subset of my protein set would decrease the computational cost.

Any suggestions?

uniref proteins alignment • 697 views

ADD COMMENT • link 17 months ago by O.rka ▴ 720

0

Entering edit mode

If UniRef90 is smaller than your set then do the search other way?

You can always try to process your proteins using MMSEQ2 to remove redundancy before doing the search.

ADD REPLY • link 17 months ago by GenoMax 142k

0

Entering edit mode

Are you thinking easy-cluster against itself and use one per cluster in the diamond call?

ADD REPLY • link 17 months ago by O.rka ▴ 720

0

Entering edit mode

You could cluster then do your normal processing. Also cluster and use UniRef90 as query against your data.

ADD REPLY • link 17 months ago by GenoMax 142k

0

Entering edit mode

My plan was to split up my job into 100 subsets of the data. Would it make sense to do Diamond against UniRef50 first and then UniRef90 since the sequence space is about 1/3? or am I missing some steps in logic on how UniRef is created?

ADD REPLY • link 17 months ago by O.rka ▴ 720

Login before adding your answer.