Question: How can I BLAST 4,54,871 sequences?
1
gravatar for tcf.hcdg
3.3 years ago by
tcf.hcdg60
European Union
tcf.hcdg60 wrote:

Hello

I have 4,54,871 sequences and I want to blast them against protein database. I need top 5 results of each blast and wanted to store those results in a single file. 

I have downloaded the BLAST+ and trying to do with standalone blast.

I wnder if this is the right way or some other methods are there? 

Does anyone have the same experience??? any suggestion?

It would be nice if someone can share the R scrit for this. I am absolutely new in this field and have to complete this assignment.

Thanks in advance

 

 

 

blast • 1.9k views
ADD COMMENTlink modified 3.3 years ago by 5heikki7.8k • written 3.3 years ago by tcf.hcdg60

Try to see if you can cluster the sequences - if lots of them are similar, it may be helpful

ADD REPLYlink written 3.3 years ago by cyril-cros820
5
gravatar for Tanvir Ahamed
3.3 years ago by
Sweden
Tanvir Ahamed 270 wrote:

Unix will be a good environment to do this . I am afraid how NCBI will allow this big quay sequences. I am doing kind a same thing but not completely similar. My primary idea for your problem is 

1. Get install BLAST on unix

2. Download the NCBI BLAST database on local pc. Before using , read the README file in the ftp site and follow their instruction. [In this point , you can make different BLAST DB according to your requirement]

3.  Once BLAST database is created , test the system with one single query sequences (Out of 4,54,871 ). To get the top five results , use "blastp" commend with "-outfmt 6" (tabular format) and extract sequence that belongs to top of the list . 

To extract sequence form a big list of file you can use faSomeRecords . Detail instruction is here. Once you extracted , save the file .

4. Repeat STEP 3 for next query sequence. for this repetition (Loop) , you can use PERL. 

[Note : You have to develop a analysis pipeline. Also , as you list of query sequence is much large, you can reduce the query sequence by sequence similarity and select a cutoff point to define different cluster. The cut off point should us selected in that way , inside one cluster  all query sequence share the same top list sequence. then in STEP 3, use only one member form every cluster of sequence]

** This completely a rough thinking on your problem . You may face some practical problem while implement the whole analysis pipeline.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Tanvir Ahamed 270

Hello

I would like to blast my sequences using NCBI standalone blast. I downloaded the lates version of blast(2.2.31+). 

After downloading I extract the files and installed according to the instruction at the website. It was installed under 

C:\Program Files\NCBI\blast-2.2.31+

this folder contain the uninstaller , bin and doc. Then I downloaded the databases "refseq_rna.00" and create a folder of "db" within the installed blast.

C:\Program Files\NCBI\blast-2.2.31+\db

Then I configured my PCy creating the the enviornment variabe "path" and giving the location of 

C:\Program Files\NCBI\blast-2.2.31+\bin

Afterwords I opened the cmd fom my system and to check wheather the installation is ok I used the following commands.

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\Ashar Hussain>cd /

C:\>cd Program Files/NCBI/blast-2.2.31+

C:\Program Files\NCBI\blast-2.2.31+>dir
 Volume in drive C has no label.
 Volume Serial Number is 40E5-5989

 Directory of C:\Program Files\NCBI\blast-2.2.31+

06/19/2015  10:06 AM    <DIR>          .
06/19/2015  10:06 AM    <DIR>          ..
06/18/2015  12:51 PM    <DIR>          bin
06/22/2015  12:27 PM    <DIR>          db
06/18/2015  12:51 PM    <DIR>          doc
06/19/2015  09:59 AM            62,465 Uninstall-ncbi-blast-2.2.31+.exe
               1 File(s)         62,465 bytes
               5 Dir(s)  228,627,599,360 bytes free

C:\Program Files\NCBI\blast-2.2.31+>cd bin

C:\Program Files\NCBI\blast-2.2.31+\bin>blastn -version
blastn: 2.2.31+
Package: blast 2.2.31, build Jun  2 2015 10:18:08

C:\Program Files\NCBI\blast-2.2.31+\bin>

Until this point this is working fine but when I want to check the databases I downloaded consol is giving the messages like this

C:\Program Files\NCBI\blast-2.2.31+\bin>blastdbcmd -db refseq_rna.00 -info

BLAST Database error: No alias or index file found for nucleotide database [refs
eq_rna.00] in search path [C:\Program Files\NCBI\blast-2.2.31+\bin;;]

I tried to resolve the error by downloading the databases again but it didno't resolved.

Does anybody has idea? any suggetions?

 

thanks in advance

ADD REPLYlink written 3.3 years ago by tcf.hcdg60

Yeah the practical problem is that this doesn't scale.

I strongly advise against step 4, there is no need to invoke blast once per sequence, each invocation has it's own overhead and there is no need for a perl wrapper. Seeing this, it would be more efficient to just give the full sequence file to blast+ and specify a number of threads according to number of CPUs.

There are better options to deal with this case, see for example Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them, parallel can also deal with clusters and distribute your jobs on multiple cores. To get a realistic estimate of the running time I would run a sufficient number of sequences, say 1000, or 10000, using the unix time command and then extrapolate. You can sometimes reduce the effect of  database loading by running blast on a single sequence to cache the database, and then take the time on many.

ADD REPLYlink written 3.3 years ago by Michael Dondrup44k
2
gravatar for genomax
3.3 years ago by
genomax57k
United States
genomax57k wrote:

That is way too many sequences (if it is NOT an NGS dataset) to do on a standalone server. If you have access to a cluster then splitting the source file into multiple parallel jobs would be the way to go. You can merge the output files later.

If this is an NGS dataset then you may want to take a look at DIAMOND as an alternative to regular BLAST: http://ab.inf.uni-tuebingen.de/software/diamond/

ADD COMMENTlink written 3.3 years ago by genomax57k
1
gravatar for Antonio R. Franco
3.3 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco3.9k wrote:

Blast2Go is one of the answers, and my favorite one

You will find a free version which is useful, and a PRO version with more features, cloud computing and a lot more rapid running the comparisons

You will get and decide the number of top results of your Blasts, get extra annotations such as EC, GO, KEGG and InterPro, do fischer tests, enrichment, and get graphics

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Antonio R. Franco3.9k
1

It is a very commercialized and for free users it is super slow. He might be able to complete his BLAST job probably next year!

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by arnstrm1.7k
1
gravatar for Michael Dondrup
3.3 years ago by
Bergen, Norway
Michael Dondrup44k wrote:

You will need a big multicore server or a multi CPU cluster to finish a blast jobs of this size and it will still take weeks or months. Another option might be to use other tools like RAPsearch2. It is key to keep the reference database as small as possible, e.g. by not using the whole NR if you don't need to.

ADD COMMENTlink written 3.3 years ago by Michael Dondrup44k

What about the mpiBLAST? I heard nothing but favorable reviews about it.

ADD REPLYlink written 3.3 years ago by arnstrm1.7k

As I remember it mpiBLAST uses the older blast (not blast+) implementation. New versions (called Abokia blast) is now a commercial product.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by genomax57k
0
gravatar for 5heikki
3.3 years ago by
5heikki7.8k
Finland
5heikki7.8k wrote:

Have you predicted proteins from your contigs or are you planning to run blastx? I would recommend the former strategy, it is much faster. Also, what do you hope to achieve with your results? If possible, opt for some database that is smaller than NCBI's nr, e.g. UniRef90. You can further reduce computational requirements by clustering your query proteins (e.g. 95% identity) and then only blast the representative sequences, but again, it all depends on what your want to achieve. Finally, utilize as many cores as possible in your blast.

ADD COMMENTlink written 3.3 years ago by 5heikki7.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 819 users visited in the last hour