Entering edit mode
3.7 years ago
Teo
▴
10
Hello every one, I'm trying to write a pipeline that identifies the RBH for the “subject” genes by means of Blast+. I have to download any two Nostoc bacterial genomes from NCBI and use one complete genome (the “reference”) and of the other only a small fraction e.g. one hundred genes (the “subject”).
Can you help me to suggest a workflow/code for this pipeline?
Forget
BLAST. UseMMseqs2'seasy-rbhmodule. The command is a simple one-liner, and you get the results in aBLAST-style output table. Here's a link to the relevant documentation: https://github.com/soedinglab/mmseqs2/wiki#reciprocal-best-hit-using-mmseqs-rbh.Thanks you for your comment. However I have to start from two FASTA file: In a FASTA file there is the reference genome of a species, and in the second FASTA (subject) there is a small fraction of the genome of another species (this second FASTA would be a multi FASTA) . I would like to identifies the RBH for the subject genes. How can I start the pipeline from these data?
Assuming your genomic sequences from the two species are in
FASTAfiles namedspecies1.fqandspecies2.fq, all you have to do is run:mmseqs easy-rbh species1.fq species2.fq results.tab tmpdirfrom the command line. (I am assuming here that you have access to a
UNIX-like operating system; likeUbuntu, for example.)All your pairwise RBH results will be in the file
results.tab.Note: you can install
MMseqs2viacondainto an environment namedmmseqs2like so:conda create -n mmseqs2 -c bioconda mmseqs2And then you can activate the environment with
conda activate mmseqs2.Ok, I've another two questions:
i) I start from these two file FASTA (genome of a species and genes of another species). To obtain the right RBH is better to aline amino acidic sequences or nucleotide sequences? or to do both?
ii) i have obtain this output by means of MMseqs2's easy-rbh module:
Which is the RBH value?
i) So amino acid sequences are better conserved, so in theory, yes. But this really depends on how sensitive you want the search to be and what you're looking for. I think for the generic use-case of trying to identify orthologs between two genomes, protein sequences work really well.
ii) There is no such thing as an
RBHvalue, at least not that I know of. It's just two searches, with the queries and targets swapping places for the second search. You just get ane-valuelike you get with a regularMMseqs2/BLASTsearch. I assume you ran the search without specifying an output format (via--format-output), so it should have defaulted to theBLASTtabular format. The output columns in this case correspond toquery, target, fident, alnlen, mismatch, gapopen, qstart, qend, tstart, tend, evalue, bits. In this example here, thee-valueappears to be0.000E+00.Edit: I see you added an image of the output. So yeah, all of those are reciprocal best hits to one another, and the
e-valuesare in the second-to-last column.ahhh ok!!! I thought that RBH is a value. Instead it's a list of values of best hits. Ok perfect. But If I would like to use Blast+... is more complicate?
It's not just a list of best hits. It's a specific subset of the best hits you'd get from a normal one-way search.
When you do a one-way search you get something like this:
And so forth.
If you swap the datasets around and search, you'd get something like this:
Not how only
q1andt1have found each other in both tables. Now if in addition they both also have the loweste-valuesfor one another, they would be each other's reciprocal best hits.If you're interested in understanding this better, I think this here is a fairly good step-by-step explanation of
RBHs and their relationship with orthology.And well, with
BLASTyou'd have to manually do everythingeasy-rbhhas automated for you, so yes it would be more complicated. And it'll also be significantly slower. Tools likeMMseqs2have been developed specifically with the objective of supersedingBLASTin mind.