Tutorial:Installing And Running Ncbi Blast
Entering edit mode
12.3 years ago

"Installing and Running NCBI BLAST" tutorial imported from the MSU course Analyzing Next Generation Sequencing Data (http://bioinformatics.msu.edu/ngs-summer-course-2011)

You should start this tutorial at a prompt that looks something like this::


Type cd to go to your home directory on your EC2 machine.

Now, use your Web browser on your laptop to go to: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST

Right- or Control-click on the file ending with 'x64-linux.tar.gz', and "copy link URL". This is the file for 64-bit (large) Linux machines, which is what our EC2 instance is. (The current URL is: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.2.25+-x64-linux.tar.gz)

Now use the 'curl' program to download it to your Amazon computer::

curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.2.25+-x64-linux.tar.gz

Here, 'curl' is a program that takes a Web link and downloads it via the command line; in this case, it's grabbing that file and saving it into your current directory.

After it completes, you should see the file in your local directory::

ls ncbi-*.tar.gz

This is a .tar.gz file, which is kind of like a zip file. You need to use the 'tar' program to unpack it (you could use 'unzip' if it were a .zip file)::

tar xzf ncbi-*.tar.gz

This will create a new subdirectory, ncbi-blast-2.2.25+::

 Dropbox  ncbi-blast-2.2.25+  ncbi-blast-2.2.25+-x64-linux.tar.gz

If you look in the blast subdirectory, you will see a few more files, most of which are directories::

ls ncbi-blast-2.2.25+
 bin  ChangeLog  doc  LICENSE  ncbi_package_info  README

In this case, we want to put everything in that bin/ directory into a common place where UNIX knows to look for programs to run. One such place (that, by convention, is a good place to install things that don't come with the computer) is /usr/local/bin::

cp ncbi-blast-2.2.25+/bin/* /usr/local/bin

Now, let's go to a new section of the machine. ::

cd /mnt

This goes to the folder named '/mnt', which is on another (bigger) disk. We'll explain this more tomorrow.

Now lets grab some biggish files to work with... the mouse and zebrafish reference proteomes!

Go to ftp://ftp.ncbi.nlm.nih.gov/refseq/ in your browser and explore a bit. You'll see there's a bunch of files and directories; in this case, we want to go grab the mouse and zebrafish protein sets. So, grab ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.protein.faa.gz and ftp://ftp.ncbi.nlm.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.protein.faa.gz::

curl -O ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.protein.faa.gz
curl -O ftp://ftp.ncbi.nlm.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.protein.faa.gz

These files aren't .tar.gz files, they're just .gz files -- the .faa means "fasta". 'gz' is a compression scheme for single files; to get at the contents, do uncompress both of them with this command::

gunzip *.gz

If you use ls, you'll see that the files have turned into mouse.protein.faa and zebrafish.protein.faa::


You can also take a look at the contents of the files with the more program, which pages through the files::

more mouse.protein.faa

Use the spacebar to scroll down, and 'q' to exit before reaching the end of the file. You can also look at the zebrafish file::

more zebrafish.protein.faa

Now, let's convert them into BLAST databases::

makeblastdb -in mouse.protein.faa -dbtype prot
makeblastdb -in zebrafish.protein.faa -dbtype prot

This lets us use BLAST to query the databases for matches.

Before we do a big BLAST, let's start by doing a small one, just to check that it's all working. To do that, we'll skim off some sequences from the top of the file::

head zebrafish.protein.faa

The problem here is that 'head' by default only selects the first 10 lines of a file, which may not be a complete set of FASTA records -- so you may have to tweak things. In this case, the first 14 lines are complete::

head -14 zebrafish.protein.faa

Let's take the output of 'head' and put it in a file, zebrafish.top, that we can use for other purposes::

head -14 zebrafish.protein.faa > zebrafish.top

OK, great! Now let's run a BLASTP comparing these zebrafish sequences to the mouse proteins, and we'll put the results in a file xxx.txt::

blastp -query zebrafish.top -db mouse.protein.faa -out xxx.txt

(The file name xxx.txt is just a throwaway file name, something that we can look at and see is a test file. You can use your own convention; I usually go with something short and recognizably silly, like 'xxx', 'yyy', 'foo', etc.)

OK, now take a look at that file with more::

more xxx.txt

Yep, looks like BLAST output to me!

There's all sorts of things you can do to alter the BLAST output; run blastp to get a list of those options. For example, '-evalue 1e-6' will set the e-value cutoff at 1e-6, above which nothing will be displayed.

Now let's run a bigger BLAST, all zebrafish proteins against all mouse proteins::

blastp -query zebrafish.protein.faa -db mouse.protein.faa -out zebrafish.x.mouse &

This is going to take a while, which is why we told the computer to give us back a command prompt while blastp runs (that's what the & does).

So, how long is it going to take? We can guesstimate by looking at how many sequences have been processed since we started. To do that, run ::

grep Query= zebrafish.x.mouse

OK, that gives us all the query lines -- now what? Let's count them with wc -l::

grep Query= zebrafish.x.mouse | wc -l

Here, | is what's known as a 'pipe', telling the command line to take the output of grep and send it to the command wc, which counts characters, words, and lines. The -l tells wc to count the lines only.

Compare that number to the number of sequences in the zebrafish protein database::

grep ^'>' zebrafish.protein.faa | more

to see the FASTA headers, and ::

grep ^'>' zebrafish.protein.faa | wc -l

to count all the sequences.

Last, but not least -- let's run a quick script to convert the file into a set of CSV matches::

python ~/Dropbox/ngs-scripts/blast/blast-to-csv.py zebrafish.x.mouse > ~/Dropbox/zebrafish-mouse.csv

Take a look at the script and see if you can understand what it does::

more ~/Dropbox/ngs-scripts/blast/blast-to-csv.py

Before you leave for lunch:

Let's start a second BLAST, all of mouse against all of zebrafish::

blastp -query mouse.protein.faa -db zebrafish.protein.faa -out mouse.x.zebrafish &

...now the computer can work while we eat!

When we come back, we can work through a reciprocal BLAST example.

.. @@ save these files

.. # scripting etc. .. @@ install biopython, ez_seutp, blastkit??

blast • 8.8k views

Login before adding your answer.

Traffic: 3391 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6