I used PSI-CD-HIT-2D to compare the proteome of pathogen A to pathogen B from the same genus at 30% identity. The matched protein sequences (homologs above 30% identity) are then compared again to pathogen C from the same genus to identify proteins that present in all the pathogens. The results were then compared to non-pathogens (D,E,F) from the same genus at 30% identity to identify proteins presents only in all 3 pathogens but absent in non-pathogens (virulence factors). Proteins consistently present in pathogens but not in non-pathogens have high possibility that they played important role in the process of typical lifestyle of pathogens.
I have then tried to search the proteins that I obtained form the above (potential virulence factors) against nr protein database using blast. But I found there are hits of the same proteins too from non-pathogens (D,E,F) that I've taken into analysis, with E-value lower than 1e-05 and identity above 30%. There are conflicts between results from both cd-hit and blast programs. I have no idea how this can happen and I'm urged to obtain solutions. Anyone can help me? Thanks in advance.