Question: How To Identify Proteins Present Only In Pathogens But Not In Non-Pathogens (Virulence Factors)?
2
gravatar for nicole
8.1 years ago by
nicole20
nicole20 wrote:

I used PSI-CD-HIT-2D to compare the proteome of pathogen A to pathogen B from the same genus at 30% identity. The matched protein sequences (homologs above 30% identity) are then compared again to pathogen C from the same genus to identify proteins that present in all the pathogens. The results were then compared to non-pathogens (D,E,F) from the same genus at 30% identity to identify proteins presents only in all 3 pathogens but absent in non-pathogens (virulence factors). Proteins consistently present in pathogens but not in non-pathogens have high possibility that they played important role in the process of typical lifestyle of pathogens.

I have then tried to search the proteins that I obtained form the above (potential virulence factors) against nr protein database using blast. But I found there are hits of the same proteins too from non-pathogens (D,E,F) that I've taken into analysis, with E-value lower than 1e-05 and identity above 30%. There are conflicts between results from both cd-hit and blast programs. I have no idea how this can happen and I'm urged to obtain solutions. Anyone can help me? Thanks in advance.

blast • 2.6k views
ADD COMMENTlink modified 3.3 years ago by priyankashrivastava410 • written 8.1 years ago by nicole20
3
gravatar for Bill Pearson
8.1 years ago by
Bill Pearson860
Bill Pearson860 wrote:

You have encountered a common problem that occurs when trying to move from a consensus-based search strategy (CD-HIT) to a pairwise based search strategy (BLASTP). In general, consensus based strategies are designed to capture deep evolutionary relationships with a single model. But sometimes, there will be sequences that are closely related (> 50% identity, E() < 1e-40) to each other, but one of the proteins can be detected by the consensus model (but is perhaps distant from its "center"), while the other cannot. (Think of two leaves on a tree on nearby branches, one of which is close enough to the root to be found with CD-HIT, but the other is just beyond detection.) The same problem occurs with PFAM.

One solution would be to use pairwise searches, rather than CD-HIT. Use BLASTP to find the proteins that are shared by the pathogenic organisms but not by non-pathogens (or use ggsearch, which I think will be better suited to this problem).

And forget about 30% identity. There will be many homologous proteins with E()-values < 1e-10 that are clearly homologous but less than 30% identical. E-values are much more reliable indicators of homology than percent identity.

ADD COMMENTlink written 8.1 years ago by Bill Pearson860

Thanks @Bill Pearson. I've tried to use standalone BLAST. I tried to merged pathogens (B,C,D) and non-pathogen (E,F,G,H) into 1 single files and make it a database (i named it mydb) using command makeblastdb. Then i compare the pathogen A against the mydb at E-value 1e-20 so that i can probably make it like clusters of protein families that fall under the e-value specified. Can this method work? How can I sort out results with families contained only protein from pathogens (A,B,C,D)? Thanks for your helps.

ADD REPLYlink written 8.1 years ago by nicole20
0
gravatar for Michael Kuhn
8.1 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

CD-HIT uses heuristics to find clusters of proteins with high similarity. (The name stands for "Cluster Database at High Identity with Tolerance".) So a threshold of 30% is well outside the intended parameter range. At such a low identity threshold the heuristic will miss many pairs that have >30% identity.

Thus, you'll have to rely on the BLAST results. Note that the e-values are dependent on the size of the database, so perhaps instead of an e-value cutoff you want to use a bitscore cutoff. Bitscores have a meaning independent of the number of genes that are in the database.

ADD COMMENTlink written 8.1 years ago by Michael Kuhn5.0k

Thanks Michael. I've tried to use standalone BLAST. I tried to merged pathogens (B,C,D) and non-pathogen (E,F,G,H) into 1 single files and make it a database (i named it mydb) using command makeblastdb. Then i compare the pathogen A against the mydb at E-value 1e-20 so that i can probably make it like clusters of protein families that fall under the e-value specified. Can this method work? How can I sort out results with families contained only protein from pathogens (A,B,C,D)? Thanks for your helps.

ADD REPLYlink written 8.1 years ago by nicole20
0
gravatar for priyankashrivastava4
3.3 years ago by
priyankashrivastava410 wrote:

Hello Nicole,

I want to know the name of databases or list of non-pathogenic bacteria for human? because all the available databases showing only pathogenic bacteria for human.

ADD COMMENTlink written 3.3 years ago by priyankashrivastava410
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 873 users visited in the last hour