Question

How to make a simple pipeline to identify the Reciprocal Best Hits for the subject genes

0

Entering edit mode

3.7 years ago

Teo ▴ 10

Hello every one, I'm trying to write a pipeline that identifies the RBH for the “subject” genes by means of Blast+. I have to download any two Nostoc bacterial genomes from NCBI and use one complete genome (the “reference”) and of the other only a small fraction e.g. one hundred genes (the “subject”).

Can you help me to suggest a workflow/code for this pipeline?

Reciprocal Hits Best • 3.9k views

ADD COMMENT • link updated 3.7 years ago by Dunois ★ 2.9k • written 3.7 years ago by Teo ▴ 10

5

Entering edit mode

Forget BLAST. Use MMseqs2's easy-rbh module. The command is a simple one-liner, and you get the results in a BLAST-style output table. Here's a link to the relevant documentation: https://github.com/soedinglab/mmseqs2/wiki#reciprocal-best-hit-using-mmseqs-rbh.

ADD REPLY • link 3.7 years ago by Dunois ★ 2.9k

0

Entering edit mode

Thanks you for your comment. However I have to start from two FASTA file: In a FASTA file there is the reference genome of a species, and in the second FASTA (subject) there is a small fraction of the genome of another species (this second FASTA would be a multi FASTA) . I would like to identifies the RBH for the subject genes. How can I start the pipeline from these data?

ADD REPLY • link 3.7 years ago by Teo ▴ 10

1

Entering edit mode

Assuming your genomic sequences from the two species are in FASTA files named species1.fq and species2.fq, all you have to do is run:

mmseqs easy-rbh species1.fq species2.fq results.tab tmpdir

from the command line. (I am assuming here that you have access to a UNIX-like operating system; like Ubuntu, for example.)

All your pairwise RBH results will be in the file results.tab.

Note: you can install MMseqs2 via conda into an environment named mmseqs2 like so:

conda create -n mmseqs2 -c bioconda mmseqs2

And then you can activate the environment with conda activate mmseqs2.

ADD REPLY • link 3.7 years ago by Dunois ★ 2.9k

1

Entering edit mode

Ok, I've another two questions:

i) I start from these two file FASTA (genome of a species and genes of another species). To obtain the right RBH is better to aline amino acidic sequences or nucleotide sequences? or to do both?

ii) i have obtain this output by means of MMseqs2's easy-rbh module:

enter image description here

Which is the RBH value?

ADD REPLY • link 3.7 years ago by Teo ▴ 10

1

Entering edit mode

i) So amino acid sequences are better conserved, so in theory, yes. But this really depends on how sensitive you want the search to be and what you're looking for. I think for the generic use-case of trying to identify orthologs between two genomes, protein sequences work really well.

ii) There is no such thing as an RBH value, at least not that I know of. It's just two searches, with the queries and targets swapping places for the second search. You just get an e-value like you get with a regular MMseqs2/BLAST search. I assume you ran the search without specifying an output format (via --format-output), so it should have defaulted to the BLAST tabular format. The output columns in this case correspond to query, target, fident, alnlen, mismatch, gapopen, qstart, qend, tstart, tend, evalue, bits. In this example here, the e-value appears to be 0.000E+00.

Edit: I see you added an image of the output. So yeah, all of those are reciprocal best hits to one another, and the e-values are in the second-to-last column.

ADD REPLY • link 3.7 years ago by Dunois ★ 2.9k

0

Entering edit mode

ahhh ok!!! I thought that RBH is a value. Instead it's a list of values of best hits. Ok perfect. But If I would like to use Blast+... is more complicate?

ADD REPLY • link 3.7 years ago by Teo ▴ 10

1

Entering edit mode

It's not just a list of best hits. It's a specific subset of the best hits you'd get from a normal one-way search.

When you do a one-way search you get something like this:

query    target    evalue
q1         t1           <>
q1         t2           <>
q2         t3           <>

And so forth.

If you swap the datasets around and search, you'd get something like this:

query    target    evalue
t1         q1           <>
t1         q4           <>
t3         q5           <>

Not how only q1 and t1 have found each other in both tables. Now if in addition they both also have the lowest e-values for one another, they would be each other's reciprocal best hits.

If you're interested in understanding this better, I think this here is a fairly good step-by-step explanation of RBHs and their relationship with orthology.

And well, with BLAST you'd have to manually do everything easy-rbh has automated for you, so yes it would be more complicated. And it'll also be significantly slower. Tools like MMseqs2 have been developed specifically with the objective of superseding BLAST in mind.

ADD REPLY • link 3.7 years ago by Dunois ★ 2.9k