Hi, I have several proteins datasets which I want to compare. I want to use sequence identity (not IDs) in order to count the number of shared proteins between each dataset pair.
I want to identify proteins which match perfectly and also proteins which are nearly identical (isoforms) and different proteins which are highly similar over a large region of both proteins.
So far, I have created a local blast database for each dataset and blasted each dataset against all others. I then parsed the XML output and have been able to find the highest scoring proteins.
I'm not sure which score is the best measurement of similarity for this task. If I look for high scoring proteins (above a cutoff) I often miss some near perfect matches and If I have similar problems with evalues. I'm writing my own filter (based on length of match vs query length, e-value and score) which is working reasonably well. Is this a suitable solution or am I missing something obvious? Thanks