Question

All human pairwise sequence identities or similarities

0

Entering edit mode

3.5 years ago

Sergio Martínez Cuesta ▴ 230

Hi everyone,

The human proteome according to UniprotKB contains 20,370 reviewed proteins. I would like to create a matrix of size 20,370 x 20,370 containing all protein sequence identities or similarities (ranging from 0 to 1). I would very much appreciate any hints regarding the following:

(a) Have protein sequences identities or similarities have already been pre-computed and available for users to download? I am familiar with the UniRef clusters of 100%, 90% and 50% sequence identity, however what I am interested is rather on the pairwise sequence identities, not so much necessarily on the sequence clusters.

(b) There are a number of robust tools that have already been developed to calculate sequence similarities / identities and cluster proteins e.g. MMseqs2, clustal omega or blastall. Any other good tool that you may be familiar for an all-against-all pairwise sequence similarity calculation (?) It would be great if you could share on this thread.

Any hints would be greatly appreciated.

Thanks, Sergio

protein sequence-comparison • 853 views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 3.5 years ago by Sergio Martínez Cuesta ▴ 230

0

Entering edit mode

I would like to create a matrix of size 20,370 x 20,370 containing all protein sequence identities or similarities (ranging from 0 to 1).

Not sure how you would come up with a score between 0 and 1. Proteins can be of very different sizes e.g. insulin vs titin. You could force them to all start at amino acid 1 but any identity matrix you generate would be a theoretical exercise.

ADD REPLY • link 3.4 years ago by GenoMax 141k