Hello! I'm trying to define one to one ortholog gene sets from various mammals' CDS data from NCBI RefSeq. My goal is to get an ortholog matrix only with one to one ortholog genes, which would be aligned and analyzed by some evolutionary programs to find interesting clade-specific genetic variants. I don't want to include one to many or many to many relationships because they can make some problems in further steps such as dNdS.
Unfortunately, several of my species are not included in Inparanoid, OrthoMCL or other ortholog-finding program's database as far as I have searched. Most of the orthology finding programs treat only precomputed data from the species in their databases. So I think there are no options but to run reciprocal best hit (RBH) blast. I'm going to select only the hits with a reciprocally exclusive relationship. Which means:
"Result of 1st (forward) blast"
Species A (query) Species B (database)
gene A1 gene B1
gene A2 gene B2, gene B3
gene A3 gene B3
gene A4 gene B4, B5
Result of 2nd (reverse) blast
Species B (query) Species A (database)
gene B1 gene A1
gene B2 gene A2
gene B3 gene A2, A3
gene B4 gene A4, A5
In this case, only "A1-B1" would be in my result matrix because: - Gene A2 matches with gene B2 and gene B3 (one to many) - Gene A3 matches with only gene B3, but gene B3 also matches with gene A2 after reverse blast (many to one) - Gene A4 matches with gene B4 and B5, so does gene B4 with A4 and A5 (many to many)
But after reading some posts on here, I learned that RBH is not a proper method to define one to one orthologs. I think that it's because there can be multiple hits after blast. Of course, I can select only the result with cases like "A1 and B1". However, I think it can be too strict criteria to get an ortholog gene set matrix to exclude any other results with acceptable indexes such as sufficiently low e-value and high bit score.
However, I cannot find any other proper method or programs to solve this problem, so I want to ask you about better approaches to define one to one orthologs by using RBH or any other methods. Or, just running RBH as I did is just enough for defining ortholog genes "roughly"? If then, can I call this matrix as one to one ortholog matrix?
I'm quite new to Bioinformatics but it's interesting amazingly. I'll be waiting for any replies and please let me know if you get confused in any sentences above because of my short English skill. Thank you for reading such a long article!