Question: BLAST multiple database against each other
0
gravatar for archana.bioinfo87
8 weeks ago by
archana.bioinfo87100 wrote:

Hi,

I am trying to do BLAST analysis. I have 9 different database (each database total size 4GB); and I want to do the BLAST analysis against all 9 databases. I am trying to find the best hit from each. As, I tried standalone blast but I am unable to get any output because still its running (~20 days)

Any one can suggest some other tool or software to solve this problem?

Any help is much appreciated.

Thanks

alignment • 298 views
ADD COMMENTlink modified 8 weeks ago by genomax58k • written 8 weeks ago by archana.bioinfo87100
1

What kind of BLAST searches is being performed? If it is BLASTP or BLASTX then you can use DIAMOND as it is much faster than BLAST.

ADD REPLYlink written 8 weeks ago by Sej Modha3.8k

Thanks. I am trying to do blastn.

ADD REPLYlink written 8 weeks ago by archana.bioinfo87100
1

Give details. What sort of database are you talking about? What is the query ?

ADD REPLYlink written 8 weeks ago by Antonio R. Franco3.9k

Its all different miRNA databases. I am trying to find the best hits among all with respect to each other.

ADD REPLYlink written 8 weeks ago by archana.bioinfo87100

if you're running this on a single core I'm not surprised

but do provide detail indeed as both Sej Modha and Antonio R. Franco point out

ADD REPLYlink written 8 weeks ago by lieven.sterck3.1k

Thanks but, I am not using single core. Its 100 core on server.

ADD REPLYlink written 8 weeks ago by archana.bioinfo87100

from a post below, we already determined that this is not the case and you effectively run your job on a single core. To run multi core blast you need to specify the -num_threads on the cmdline (if you don't you get the default and that is 1 )

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by lieven.sterck3.1k
1
gravatar for genomax
8 weeks ago by
genomax58k
United States
genomax58k wrote:

You can use blastdb_aliastool to create a single blast database alias from all 9 databases, which you can then use for blast.

As, I tried standalone blast but I am unable to get any output because still its running (~20 days)

If you don't see any output from the search it is likely that the process is hung. Is there an output file that you see which is growing in size or is it empty?

ADD COMMENTlink written 8 weeks ago by genomax58k

Thanks but out file is empty and I think still its running.

ADD REPLYlink written 8 weeks ago by archana.bioinfo87100

If it has not produced a single line of output after 20 days (with 100 cores) it is unlikely that blast is running (or at least productively). You should stop and restart the search. How long are your query sequences and what exactly is in your target databases?

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by genomax58k

as other have pointed out, you should in any case get a least a few lines of output (=blast output header and some other info) within the fist minutes of running the blast , if oyu don't then something is indeed wrong.

Can you post the blast cmdline you are trying to execute?

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by lieven.sterck3.1k

I am trying to run this command on server

#!/bin/bash
# Sample Slurm Script for use with OpenMPI on Plato
# Begin Slurm directives with #SBATCH
#SBATCH --job-name=multidb_test2_$1
#SBATCH --nodes=10
#SBATCH --tasks-per-node=10
#SBATCH --cpus-per-task=1
#SBATCH --time=480:00:00
#SBATCH --mem=16G
#SBATCH --output=multidb_test2.%J.out
#SBATCH --error=multidb_test2.%J.err

#for i in $(seq 1 8); do ../makeblastdb -in $i\_*.fa -input_type fasta -dbtype nucl -title $i\_db -out $i\_db; done

for i in $(seq 6 8); do ../blastn -query $i\_*serial.fa -db 5_db -qcov_hsp_perc 90 -outfmt 6 | sort -k1,1 -k12,12nr -k11,11n | sort -u -k1,1 --merge > $i\_5db_hits.blastn; done

Please let me know how to get it done perfectly.

Thanks

ADD REPLYlink modified 8 weeks ago by genomax58k • written 8 weeks ago by archana.bioinfo87100

will have to look in detail but on first sight the use of the * wildcard will not work in this command line. Blast assumes a single fasta file as input (both for the makeblastdb and the blastn) . Why are you using it? Or what do you think to achieve with it?

Moreover, I think you're trying to do this over complicated.

ADD REPLYlink written 8 weeks ago by lieven.sterck3.1k

Thanks for your reply. As, I tried "*" in makeblastdb to make multiple databases at a time and it worked. Regarding complicated analysis... yes it is that is why I was trying to put in loop to get the result. As, it seems not working so I split the fasta file of all databases and added in loop. And its working.

I can't understand why it was not working in big files.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by archana.bioinfo87100

Have you tested to make sure the databases have actually been properly made? Just because the command completed (did you check the log files) there is no guarantee that all is ok. Before you jump into a large job like this it is always best to test with a file or two to see if things are working ok. It is also a bad idea to put a for loop inside a SLURM job. Here is one example of how you may run these jobs.

for i in $(ls *serial.fa | sed 's/.fa//'); do echo sbatch -n 10 -N 1 --time=480:00:00 --mem=16G --output=multidb_test2.%J.out --error=multidb_test2.%J.err --wrap="../blastn -Num_threads 10 -query $i.fa -db 5_db -qcov_hsp_perc 90 -outfmt 6 | sort -k1,1 -k12,12nr -k11,11n | sort -u -k1,1 --merge > $i_5db_hits.blastn"; done

If the commands look sane then remove the word echo to actually submit the jobs.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by genomax58k

thanks for the insights on the slurm part genomax , I'm not familiar with slurm myseslf.

and well spotted that OP forgot to add the -Num_threads 10 , the blast was thus running on a single core, though the slurrm job requested 100. That still does not explain however that no output was provided.

ADD REPLYlink written 8 weeks ago by lieven.sterck3.1k

My hunch is that the databases have not been made properly hence there is no output (besides other problems with the command line you noted).

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by genomax58k

Thanks, but already i tested with small sequence before putting them in loop.

ADD REPLYlink written 8 weeks ago by archana.bioinfo87100

Thanks, yes i tested all databases before putting them in this big for loop. Till date only i can think the memory issue. Because for small seqeuences i am getting the results.

Thanks all for your suggestions

ADD REPLYlink written 8 weeks ago by archana.bioinfo87100

For the use of the " * ": I tested it myself and it will only work if there is only a single file that will match, if that is the case then that's fine. Along the same line: will there be only a single file matching $i\_*serial.fa ?

I'm still struggling with the loop for doing the blast itself. Why are you doing all those sorts of the output? And the given cmdline will only report the result of 3 input files against a db called 5_db, is that the complete cmdline you execute?

ADD REPLYlink written 8 weeks ago by lieven.sterck3.1k

Yes, only one database will match with another 8 one by one.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by archana.bioinfo87100

Still not convinced it's all correct what you are doing.

As been pointed out before: you can only gain by using the create alias for DBs approach. A cmdline as follows would achieve that already:

blastdb_aliastool -dblist "1_db 2_db 3_db 4_db 5_db 6_db 7_db 8_db" -dbtype nucl -out all_db -title "all subDB"

this will give you a single DB to use in your blast cmdline, so no need anymore to loop over all your DBs. Moreover it is generally not a good idea to split up your DB, and the way you are running it would make it nearly impossible to compare the score of hits between the diff DBs as they are specific for each query-DB search.

You can split up the input query file of you blasts, that's totally fine

It also seems to me that there is no point in sorting the output from the blast in your case as this is more-or-less the already sorted tabular output blast provides.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by lieven.sterck3.1k

Sorry, may be I was unable to explain but I splited the query file not the DB. Already, I tried blastdb_aliastool but still waiting for result.

I am sorting the BLAST result to get the exact matched sequences only.

Thanks for your valuable suggestions.

ADD REPLYlink written 8 weeks ago by archana.bioinfo87100

you totally lost me now ...

Didn't you mention you created (or had) different DBs? What part of the blastdb_aliastool are you waiting for (that part itself should run instantaneously) or is it for the blast itself?

ADD REPLYlink written 8 weeks ago by lieven.sterck3.1k

Dear, I think I already mentioned in my question that I have 9 different databases. Things which I already tried 1. I created 9 different databases using makeblastdb. And tried the loop for 1 database sequences blast with respect to other using loop. (~20 days ran but no result) 2. Tried the blastalias tool as well and waiting for some output from last few days. 3. I splited the query file and did blast with respect to single database; I am getting the result.

Please suggest something suitable for this query.

Thanks

ADD REPLYlink written 7 weeks ago by archana.bioinfo87100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1980 users visited in the last hour