Question: Comparing Protein Datasets By Sequence
gravatar for Kevin
8.7 years ago by
Kevin100 wrote:

Hi, I have several proteins datasets which I want to compare. I want to use sequence identity (not IDs) in order to count the number of shared proteins between each dataset pair.

I want to identify proteins which match perfectly and also proteins which are nearly identical (isoforms) and different proteins which are highly similar over a large region of both proteins.

So far, I have created a local blast database for each dataset and blasted each dataset against all others. I then parsed the XML output and have been able to find the highest scoring proteins.

I'm not sure which score is the best measurement of similarity for this task. If I look for high scoring proteins (above a cutoff) I often miss some near perfect matches and If I have similar problems with evalues. I'm writing my own filter (based on length of match vs query length, e-value and score) which is working reasonably well. Is this a suitable solution or am I missing something obvious? Thanks

comparison blast • 2.5k views
ADD COMMENTlink modified 5.1 years ago by Biostar ♦♦ 20 • written 8.7 years ago by Kevin100
gravatar for Leonor Palmeira
8.7 years ago by
Leonor Palmeira3.7k
Li├Ęge, Belgium
Leonor Palmeira3.7k wrote:

This seems to me as a perfectly suitable solution. In the case where you want to determine the similarity between two proteins (which is what your problem boils down to), I would indeed recommend a filter based on %similarity (or %identity, which is fairly linearly correlated), %length of matching fragments (compared to query length), and of course a threshold on the e-value. I rarely use thresholds on the score, as this is correlated with sequence length and is not easily comparable.

ADD COMMENTlink written 8.7 years ago by Leonor Palmeira3.7k
gravatar for Eric Fournier
8.7 years ago by
Eric Fournier1.4k
Quebec, Canada
Eric Fournier1.4k wrote:

If you are using NCBI's command line blast suite, the blast2 program -m 8 (Alignment view options -> tabular) will output your results in a tabular format which contains the identity percent of the match as well as the alignment length. It will also be a lot more straightforward to parse than XML output.

You're still going to have to grab the original sequences' lengths from elsewhere, though.

ADD COMMENTlink written 8.7 years ago by Eric Fournier1.4k
gravatar for Iain
8.7 years ago by
Iain260 wrote:

The CD-HIT programme would be very useful for this task

ADD COMMENTlink modified 6 months ago by RamRS26k • written 8.7 years ago by Iain260

I tried to use CD-HIT for such a task some years ago, and it didn't work as well as BLAST. The clustering would miss some hits in an m:n orthology situation.

ADD REPLYlink written 8.7 years ago by Michael Kuhn5.0k

interesting. I don't think I would expect it to be particularly great at orthology detection.

That being said, it is a useful tool for identifying a) identical sequences b) very similar sequences very quickly with out having to apply extra criteria as one would have to do with a BLAST like approach.

ADD REPLYlink written 8.7 years ago by Iain260
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2224 users visited in the last hour