Hi. I want to analyze the coding sequence (CDS) data of closely related vertebrates in NCBI database. Most of the species are in the same order. So I've been looking for the proper method to obtain ortholog gene sets from CDS data of these species and now I try RBH method (Reciprocal best hit) by BLAST.
1st BLAST: query-A species, database-B species 2nd BLAST: query-B species, database-A species
Then what I want to ask is that if there are two or more B genes with same highest bit score in the result of the first BLAST using the query of A, should I choose one of these B genes? For example,
species A B
gene1 a1 b1
gene2 a2 b2
gene3 a3 b3
gene4 a4 b4
and if the result of the first BLAST is...
query: a1 =BLAST=> hit: b1 (with highest bit score and identity)
query: a2 =BLAST=> hit: b1 (with highest bit score and identity)
query: a3 =BLAST=> hit: b3 (with highest bit score and identity)
query: a4 =BLAST=> hit: b2, b4 (with same and highest bit score and identity on b2 and b4, mutually)
If I choose one of these B genes, is there any proper index (like identity or e-value...)to compare the genes and choose the best one? Or, if it's ok to use all of them as the next query for the second BLAST, it seems easier for me. But then, can we know the N to N relationships between the gene of A and B without regarding the case like <query: a4="BLAST=>" hit:="" b2,="" b4=""> or <a1, a2="" and="" b1="" are="" on="" "many="" to="" one"="" relationship="">? (see below to see my thought)
- a1, a2 and b1 are on "many to one" relationship (-> not ortholog).
- a3 and b3 are on "one to one" ortholog relationship (-> ortholog).
- a4 and b2, b3 are on "one to many" ortholog relationship (-> not ortholog).
In my regards, it's impossible but then why RBH method is regarded as the good method even though it can't say the real N to N relationship between genes which is very important to find ortholog gene sets?
Sorry for the long question. I'm waiting for your help! Thanks!
Hi lieven.sterck. Thanks for your kind explanation! It helps me a lot, especially for understanding the limitation of RBH method and the definition of one-to-one orthologs. But there are still some questions remained. Could you help me some more, please?
I couldn't provide my reverse blast results because I've not performed the second blast. In order to do the reverse one, I should decide the next query genes. But It's very confusing for me how to treat the genes with same bit score after the first blast. What I confused is that if there are two (or more) B genes with same, highest bit scores (example==> query: a1 gene, hit1: b1 gene with bit score 750, hit2: b2 gene with bit score 750), I'm not sure if I can use both of them as the query for the second blast. Is it OK or should I remove one of the two genes by using any other methods?
I can't use Inparanoid or other ortholog defining programs because some of my species are not included in their databases. So I think I should keep using RBH.
Thanks again for your help!
Well, for the RBH approach I would advise to simply blast the complete set of A proteins against B and also B against A (no subset, just the complete fasta files). In a second step you simply parse the results of both blasts and select the proteins for which a1 - b1 and b1 - a1 . it's not more convenient it's also more correct as by using subsets (or filtering before blasting) you might bias the results.
You can nearly any orthology program also with "new" species, as certainly it is possible for inparanoid (give it a try)