How To Identify Proteins Present Only In Pathogens But Not In Non-Pathogens (Virulence Factors)?
Entering edit mode
12.8 years ago
nicole ▴ 20

I used PSI-CD-HIT-2D to compare the proteome of pathogen A to pathogen B from the same genus at 30% identity. The matched protein sequences (homologs above 30% identity) are then compared again to pathogen C from the same genus to identify proteins that present in all the pathogens. The results were then compared to non-pathogens (D,E,F) from the same genus at 30% identity to identify proteins presents only in all 3 pathogens but absent in non-pathogens (virulence factors). Proteins consistently present in pathogens but not in non-pathogens have high possibility that they played important role in the process of typical lifestyle of pathogens.

I have then tried to search the proteins that I obtained form the above (potential virulence factors) against nr protein database using blast. But I found there are hits of the same proteins too from non-pathogens (D,E,F) that I've taken into analysis, with E-value lower than 1e-05 and identity above 30%. There are conflicts between results from both cd-hit and blast programs. I have no idea how this can happen and I'm urged to obtain solutions. Anyone can help me? Thanks in advance.

blast • 3.7k views
Entering edit mode
12.8 years ago
Bill Pearson ★ 1.0k

You have encountered a common problem that occurs when trying to move from a consensus-based search strategy (CD-HIT) to a pairwise based search strategy (BLASTP). In general, consensus based strategies are designed to capture deep evolutionary relationships with a single model. But sometimes, there will be sequences that are closely related (> 50% identity, E() < 1e-40) to each other, but one of the proteins can be detected by the consensus model (but is perhaps distant from its "center"), while the other cannot. (Think of two leaves on a tree on nearby branches, one of which is close enough to the root to be found with CD-HIT, but the other is just beyond detection.) The same problem occurs with PFAM.

One solution would be to use pairwise searches, rather than CD-HIT. Use BLASTP to find the proteins that are shared by the pathogenic organisms but not by non-pathogens (or use ggsearch, which I think will be better suited to this problem).

And forget about 30% identity. There will be many homologous proteins with E()-values < 1e-10 that are clearly homologous but less than 30% identical. E-values are much more reliable indicators of homology than percent identity.

Entering edit mode

Thanks @Bill Pearson. I've tried to use standalone BLAST. I tried to merged pathogens (B,C,D) and non-pathogen (E,F,G,H) into 1 single files and make it a database (i named it mydb) using command makeblastdb. Then i compare the pathogen A against the mydb at E-value 1e-20 so that i can probably make it like clusters of protein families that fall under the e-value specified. Can this method work? How can I sort out results with families contained only protein from pathogens (A,B,C,D)? Thanks for your helps.

Entering edit mode
12.8 years ago

CD-HIT uses heuristics to find clusters of proteins with high similarity. (The name stands for "Cluster Database at High Identity with Tolerance".) So a threshold of 30% is well outside the intended parameter range. At such a low identity threshold the heuristic will miss many pairs that have >30% identity.

Thus, you'll have to rely on the BLAST results. Note that the e-values are dependent on the size of the database, so perhaps instead of an e-value cutoff you want to use a bitscore cutoff. Bitscores have a meaning independent of the number of genes that are in the database.

Entering edit mode

Thanks Michael. I've tried to use standalone BLAST. I tried to merged pathogens (B,C,D) and non-pathogen (E,F,G,H) into 1 single files and make it a database (i named it mydb) using command makeblastdb. Then i compare the pathogen A against the mydb at E-value 1e-20 so that i can probably make it like clusters of protein families that fall under the e-value specified. Can this method work? How can I sort out results with families contained only protein from pathogens (A,B,C,D)? Thanks for your helps.

Entering edit mode
8.0 years ago

Hello Nicole,

I want to know the name of databases or list of non-pathogenic bacteria for human? because all the available databases showing only pathogenic bacteria for human.


Login before adding your answer.

Traffic: 3515 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6