Hi. I want to analyze the coding sequence (CDS) data of closely related vertebrates in NCBI database. Most of the species are in the same order. So I've been looking for the proper method to obtain ortholog gene sets from CDS data of these species and now I try RBH method (Reciprocal best hit) by BLAST.
1st BLAST: query-A species, database-B species 2nd BLAST: query-B species, database-A species
Then what I want to ask is that if there are two or more B genes with same highest bit score in the result of the first BLAST using the query of A, should I choose one of these B genes? For example,
species A B
gene1 a1 b1
gene2 a2 b2
gene3 a3 b3
gene4 a4 b4
and if the result of the first BLAST is...
query: a1 =BLAST=> hit: b1 (with highest bit score and identity)
query: a2 =BLAST=> hit: b1 (with highest bit score and identity)
query: a3 =BLAST=> hit: b3 (with highest bit score and identity)
query: a4 =BLAST=> hit: b2, b4 (with same and highest bit score and identity on b2 and b4, mutually)
If I choose one of these B genes, is there any proper index (like identity or e-value...)to compare the genes and choose the best one? Or, if it's ok to use all of them as the next query for the second BLAST, it seems easier for me. But then, can we know the N to N relationships between the gene of A and B without regarding the case like <query: a4="BLAST=>" hit:="" b2,="" b4=""> or <a1, a2="" and="" b1="" are="" on="" "many="" to="" one"="" relationship="">? (see below to see my thought)
- a1, a2 and b1 are on "many to one" relationship (-> not ortholog).
- a3 and b3 are on "one to one" ortholog relationship (-> ortholog).
- a4 and b2, b3 are on "one to many" ortholog relationship (-> not ortholog).
In my regards, it's impossible but then why RBH method is regarded as the good method even though it can't say the real N to N relationship between genes which is very important to find ortholog gene sets?
Sorry for the long question. I'm waiting for your help! Thanks!