Question

How can I BLAST 4,54,871 sequences?

2

Entering edit mode

8.8 years ago

tcf.hcdg ▴ 70

Hello

I have 4,54,871 sequences and I want to blast them against protein database. I need top 5 results of each blast and wanted to store those results in a single file.

I have downloaded the BLAST+ and trying to do with standalone blast.

I wonder if this is the right way or some other methods are there?

Does anyone have the same experience? Any suggestions?

It would be nice if someone can share the R script for this. I am absolutely new in this field and have to complete this assignment.

Thanks in advance

BLAST • 8.1k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by tcf.hcdg ▴ 70

0

Entering edit mode

Try to see if you can cluster the sequences - if lots of them are similar, it may be helpful

ADD REPLY • link 8.8 years ago by cyril-cros ▴ 950

Ram · Answer 1 · 2015-06-22

6

Entering edit mode

8.8 years ago

Tanvir Ahamed ▴ 350

Unix will be a good environment to do this . I am afraid how NCBI will allow this big quay sequences. I am doing kind a same thing but not completely similar. My primary idea for your problem is

Get install BLAST on unix
Download the NCBI BLAST database on local pc. Before using, read the README file in the ftp site and follow their instructions. [At this point, you can make different BLAST DB according to your requirement]
Once BLAST database is created, test the system with one single query sequences (Out of 4,54,871). To get the top five results, use blastp command with -outfmt 6 (tabular format) and extract sequence that belongs to top of the list. To extract sequence form a big list of file you can use faSomeRecords. Detailed instructions here. Once you extracted, save the file.
Repeat STEP 3 for next query sequence. for this repetition (Loop) , you can use PERL.

[Note : You have to develop a analysis pipeline. Also , as you list of query sequence is much large, you can reduce the query sequence by sequence similarity and select a cutoff point to define different cluster. The cut off point should us selected in that way, inside one cluster all query sequence share the same top list sequence. then in STEP 3, use only one member form every cluster of sequence]

** This completely a rough thinking on your problem . You may face some practical problem while implement the whole analysis pipeline.

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by Tanvir Ahamed ▴ 350

0

Entering edit mode

Hello

I would like to blast my sequences using NCBI standalone blast. I downloaded the latest version of blast(2.2.31+).

After downloading I extract the files and installed according to the instruction at the website. It was installed under

C:\Program Files\NCBI\blast-2.2.31+

This folder contain the uninstaller , bin and doc. Then I downloaded the databases refseq_rna.00 and create a folder of "db" within the installed blast.

C:\Program Files\NCBI\blast-2.2.31+\db

Then I configured my PCy creating the the environment variable "path" and giving the location of

C:\Program Files\NCBI\blast-2.2.31+\bin

Afterwards I opened the cmd from my system and to check whether the installation is OK I used the following commands.

Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:\Users\Ashar Hussain>cd /
C:\>cd Program Files/NCBI/blast-2.2.31+
C:\Program Files\NCBI\blast-2.2.31+>dir
 Volume in drive C has no label.
 Volume Serial Number is 40E5-5989

 Directory of C:\Program Files\NCBI\blast-2.2.31+

06/19/2015  10:06 AM    <DIR>          .
06/19/2015  10:06 AM    <DIR>          ..
06/18/2015  12:51 PM    <DIR>          bin
06/22/2015  12:27 PM    <DIR>          db
06/18/2015  12:51 PM    <DIR>          doc
06/19/2015  09:59 AM            62,465 Uninstall-ncbi-blast-2.2.31+.exe
               1 File(s)         62,465 bytes
               5 Dir(s)  228,627,599,360 bytes free

C:\Program Files\NCBI\blast-2.2.31+>cd bin

C:\Program Files\NCBI\blast-2.2.31+\bin>blastn -version
blastn: 2.2.31+
Package: blast 2.2.31, build Jun  2 2015 10:18:08

C:\Program Files\NCBI\blast-2.2.31+\bin>

Until this point this is working fine but when I want to check the databases I downloaded consol is giving the messages like this

C:\Program Files\NCBI\blast-2.2.31+\bin>blastdbcmd -db refseq_rna.00 -info

BLAST Database error: No alias or index file found for nucleotide database [refs
eq_rna.00] in search path [C:\Program Files\NCBI\blast-2.2.31+\bin;;]

I tried to resolve the error by downloading the databases again but it didn't resolve.

Does anybody has idea? Any suggestions?

Thanks in advance

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by tcf.hcdg ▴ 70

0

Entering edit mode

Yeah the practical problem is that this doesn't scale.

I strongly advise against step 4, there is no need to invoke blast once per sequence, each invocation has it's own overhead and there is no need for a perl wrapper. Seeing this, it would be more efficient to just give the full sequence file to blast+ and specify a number of threads according to number of CPUs.

There are better options to deal with this case, see for example Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them, parallel can also deal with clusters and distribute your jobs on multiple cores. To get a realistic estimate of the running time I would run a sufficient number of sequences, say 1000, or 10000, using the unix time command and then extrapolate. You can sometimes reduce the effect of database loading by running blast on a single sequence to cache the database, and then take the time on many.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by Michael 54k

Ram · Answer 2 · 2015-06-22

That is way too many sequences (if it is NOT an NGS dataset) to do on a standalone server. If you have access to a cluster then splitting the source file into multiple parallel jobs would be the way to go. You can merge the output files later.

If this is an NGS dataset then you may want to take a look at DIAMOND as an alternative to regular BLAST: http://ab.inf.uni-tuebingen.de/software/diamond/

Ram · Answer 3 · 2015-06-21

1

Entering edit mode

8.8 years ago

Antonio R. Franco ★ 5.1k

Blast2Go is one of the answers, and my favorite one

You will find a free version which is useful, and a PRO version with more features, cloud computing and a lot more rapid running the comparisons

You will get and decide the number of top results of your Blasts, get extra annotations such as EC, GO, KEGG and InterPro, do fischer tests, enrichment, and get graphics

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by Antonio R. Franco ★ 5.1k

1

Entering edit mode

It is a very commercialized and for free users it is super slow. He might be able to complete his BLAST job probably next year!

ADD REPLY • link 8.8 years ago by arnstrm ★ 1.8k

Ram · Answer 4 · 2015-06-22

1

Entering edit mode

8.8 years ago

Michael 54k

You will need a big multicore server or a multi CPU cluster to finish a blast jobs of this size and it will still take weeks or months. Another option might be to use other tools like RAPsearch2. It is key to keep the reference database as small as possible, e.g. by not using the whole NR if you don't need to.

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by Michael 54k

0

Entering edit mode

What about the mpiBLAST? I heard nothing but favorable reviews about it.

ADD REPLY • link 8.8 years ago by arnstrm ★ 1.8k

0

Entering edit mode

As I remember it mpiBLAST uses the older blast (not blast+) implementation. New versions (called Abokia blast) is now a commercial product.

ADD REPLY • link 8.8 years ago by GenoMax 141k

score 0 · Answer 5 · 2015-06-22

Have you predicted proteins from your contigs or are you planning to run blastx? I would recommend the former strategy, it is much faster. Also, what do you hope to achieve with your results? If possible, opt for some database that is smaller than NCBI's nr, e.g. UniRef90. You can further reduce computational requirements by clustering your query proteins (e.g. 95% identity) and then only blast the representative sequences, but again, it all depends on what your want to achieve. Finally, utilize as many cores as possible in your blast.