Dear all,
As my title mentioned, could you please give me some suggestions about the BLASTP criteria of identifying paralogous and orthologous genes among a few species. The species I am analyzing do not have much sequencing data in NCBI, but our lab recently generate HT-seq data for them.
I found evalue of 1e-5 is not strict enough for para- or ortho- identification. I think it is necessary to further limit the criteria. I found a paper (Bioinformation. 2011; 6(1): 31) used >60% sequence identity and >80% alignment length, but I am not certain if it is a general rule.
Any of your answers will be highly appreciated! THANKS!
Thanks a lot joe.cornish826.
The key difference between what that paper did and what best reciprocal blast hits (BRBH) is that only BRBH can distinguish between paralogs and orthologs. Just blasting in one direction only allows you to identify homologs.
Also, there are many other tools out there that perform ortholog detection/mining using a variety of approaches. Some are still BRBH at the core but use additional metrics or improve the ease of use/analysis of results.
http://www.biomedcentral.com/1471-2105/12/11
This tool describes the BRBH process in better detail and does a good job of describing the different relationships that proteins/genes/etc can have.
In general, there's no solid rule on what settings to use. The best approach would be to develop a benchmark to characterize the sensitivity/specificity using a set of known items, maybe two well known species in similar genera/families of your organisms of interest. You can (and should do this anyways) manually spot check individual results and compile some stats to get a feel for what is going.
A more simple approach would be to find some pubs that have done this analysis on species close to yours and get a consensus for the settings. However, be sure that they're actually doing a true ortholog analysis!