Question

How To Compare Two Proteome Sets At 30% Similarity?

2

Entering edit mode

12.1 years ago

Sydney ▴ 20

I wish to compare the proteome of two bacterial species and identify homologs at 30% similarity. I tried to use CD-HIT-2d but the lowest similarity is 0.4, which what I want is 0.3. Anyone knows how to do it? Thanks.

proteomics • 4.7k views

ADD COMMENT • link updated 12.1 years ago by Sydney • 0 • written 12.1 years ago by Sydney ▴ 20

0

Entering edit mode

Unclear question. Are you saying the lowest similarity found by CD-HIT-2D was 0.4, or that the lowest available similarity threshold for CD-HIT-2D is 0.4?

ADD REPLY • link 12.1 years ago by Neilfws 49k

0

Entering edit mode

Thanks everyone for the comments. Actually I want to identify the proteins (probably the virulence factors) which exist only in those pathogens and not in the non-pathogens. I wonder I can compare the proteomes of the pathogens and purged out the orthologs at 25% or 30%, and at the end I can get those proteins which exist in all the pathogens that I analyzed. Anyone can give me a better suggestion on how to do this? Thanks in advance.

ADD REPLY • link 12.1 years ago by Sydney ▴ 20

0

Entering edit mode

have you considered MG-RAST or JGI's IMG/M? you can upload your datasets to these servers and get annotations from a variety of DB's such as COGs KEGGs etc. you can then filter out the functions of interest. you could also do an ALL-VS-ALL BLAST and then annotate only the shared proteins.

ADD REPLY • link 12.1 years ago by Schrodinger'S Cat ▴ 210

score 4 · Answer 1 · 2012-03-29

To start with, I would completely forget about using CD-HIT for this purpose. The point of CD-HIT is that it is a very fast algorithm for finding highly similar sequences. 30% similarity is not high similarity. Moreover bacterial genomes are not so big that speed is your primary concern here.

The exact way to do it depends a bit on whether you are looking for global or local similarity scores. Assuming that you are interested in local alignments, I would use BLAST and filter the results by whatever combination of identity, similarity, alignment length, bit score, or E-value cutoff you desire.

score 4 · Answer 2 · 2012-03-29

4

Entering edit mode

12.1 years ago

Ahdf-Lell-Kocks ★ 1.6k

I would try jackhmmer from the HMMER 3.0 package. Something like:

~/hmmer3.0/jackhmmer proteome1.fasta proteome2.fasta

ADD COMMENT • link 12.1 years ago by Ahdf-Lell-Kocks ★ 1.6k

Ram · Answer 3 · 2012-03-29

the cd-hit package contains a perl script called PSI-CD-HIT for low ID cutoffs. this script uses BLAST as part of the clustering process to calculate similarities. it is not as fast as regular CD-HIT, mind. http://weizhong-lab.ucsd.edu/cd-hit/wiki/doku.php?id=cd-hit_user_guide#psi-cd-hit_clustering

however, I'm not sure this is the right tool for you. an All-VS-All BLAST of the two species can easily be done on most desktops and would probably be better.

score 0 · Answer 4 · 2012-03-28

0

Entering edit mode

12.1 years ago

Chris ▴ 190

I would blast the first against the second proteome at some low e-value cutoff (< e-3). Then you could either take the similarity values from blast (local alignment), or run a global alignment on significant hits to get global alignment similarities.

ADD COMMENT • link 12.1 years ago by Chris ▴ 190