USEARCH for orthologous genes identification
0
1
Entering edit mode
6.9 years ago
biolab ★ 1.3k

Hi everyone,

I am using USEARCH to identify orthologous genes between two species.  I set evalue cutoff 1e-5 and top hit option.  However, I am suspicious of this in silico method, I show an example as below,  Is my method somewhere wrong?  THANKS a lot for any of your suggestions!

Query >LOC_Os07g04960_1
Score     Evalue   %Id    QueryLo-Hi(Un)   TargetLo-Hi(Un)  Target
228      3e-19   42%         39-149(2)       268-383(18)  AT5G15780_1

Qry  39 PAAAIPAVPAMPKPTIPTIVPAVTLPPIPAVPKVTLPPMPAIPTVPAVTMPPMPAVPAVPAVTLPPMPAVPTVPPNTVV 117
| . ||     | | ||.| |  |||| | :|   .|||.| ||| |  |:| .| .|  |  ||||.| :||.||  |.
Tgt 268 PPSIIP-----PNPLIPSI-PTPTLPPNPLIPSPPSLPPIPLIPTPP--TLPTIPLLPTPPTPTLPPIPTIPTLPPLPVL 339

Qry 118 VPAAVV--PALP------KVALPPMAAVPNVP----MPFLAPPP 149
|  :|  |.||       | |||.  .| :|    .| : | |
Tgt 340 PPVPIVNPPSLPPPPPSFPVPLPPVPGLPGIPPVPLIPGIPPAP 383

124 cols, 52 ids (41.9%), 21 gaps (16.9%), score 228.0 (92.4 bits), Evalue 2.5e-19

usearch orthologous blast • 1.8k views
1
Entering edit mode

In the example you posted, it seems most of the alignment is in the low complexity region. First, USEARCH might not be a good choice if you want to identify significantly diverged sequences. From the manual

Recommended identity ranges
USEARCH is effective at identities of ~50% and above for proteins and ~75% and above for nucleotides.

See if you can avoid the problem in the example you posted by using "seg" for masking the repetitive and low-complexity regions instead of the default method USEARCH uses. Is there any other specific reason to doubt the accuracy of USEARCH?

0
Entering edit mode

Hi Siva, thank you very much for your reply.  Your comment is very helpful.  I need to set identity cutoff.

0
Entering edit mode

Depending on your species of interest, you may want to have a look at the orthologues from the Comparative Genomics analyses in Ensembl.

0
Entering edit mode