"Installing and Running NCBI BLAST" tutorial imported from the MSU course Analyzing Next Generation Sequencing Data (http://bioinformatics.msu.edu/ngs-summer-course-2011)
You should start this tutorial at a prompt that looks something like this::
cd to go to your home directory on your EC2 machine.
Now, use your Web browser on your laptop to go to: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST
Right- or Control-click on the file ending with 'x64-linux.tar.gz', and "copy link URL". This is the file for 64-bit (large) Linux machines, which is what our EC2 instance is. (The current URL is: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.2.25+-x64-linux.tar.gz)
Now use the 'curl' program to download it to your Amazon computer::
curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/ncbi-blast-2.2.25+-x64-linux.tar.gz
Here, 'curl' is a program that takes a Web link and downloads it via the command line; in this case, it's grabbing that file and saving it into your current directory.
After it completes, you should see the file in your local directory::
This is a .tar.gz file, which is kind of like a zip file. You need to use the 'tar' program to unpack it (you could use 'unzip' if it were a .zip file)::
tar xzf ncbi-*.tar.gz
This will create a new subdirectory,
ls Dropbox ncbi-blast-2.2.25+ ncbi-blast-2.2.25+-x64-linux.tar.gz
If you look in the blast subdirectory, you will see a few more files, most of which are directories::
ls ncbi-blast-2.2.25+ bin ChangeLog doc LICENSE ncbi_package_info README
In this case, we want to put everything in that
bin/ directory into a common place where UNIX knows to look for programs to run. One such place (that, by convention, is a good place to install things that don't come with the computer) is
cp ncbi-blast-2.2.25+/bin/* /usr/local/bin
Now, let's go to a new section of the machine. ::
This goes to the folder named '/mnt', which is on another (bigger) disk. We'll explain this more tomorrow.
Now lets grab some biggish files to work with... the mouse and zebrafish reference proteomes!
Go to ftp://ftp.ncbi.nlm.nih.gov/refseq/ in your browser and explore a bit. You'll see there's a bunch of files and directories; in this case, we want to go grab the mouse and zebrafish protein sets. So, grab ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.protein.faa.gz and ftp://ftp.ncbi.nlm.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.protein.faa.gz::
curl -O ftp://ftp.ncbi.nlm.nih.gov/refseq/M_musculus/mRNA_Prot/mouse.protein.faa.gz curl -O ftp://ftp.ncbi.nlm.nih.gov/refseq/D_rerio/mRNA_Prot/zebrafish.protein.faa.gz
These files aren't .tar.gz files, they're just .gz files -- the .faa means "fasta". 'gz' is a compression scheme for single files; to get at the contents, do uncompress both of them with this command::
If you use
ls, you'll see that the files have turned into
You can also take a look at the contents of the files with the
more program, which pages through the files::
Use the spacebar to scroll down, and 'q' to exit before reaching the end of the file. You can also look at the zebrafish file::
Now, let's convert them into BLAST databases::
makeblastdb -in mouse.protein.faa -dbtype prot makeblastdb -in zebrafish.protein.faa -dbtype prot
This lets us use BLAST to query the databases for matches.
Before we do a big BLAST, let's start by doing a small one, just to check that it's all working. To do that, we'll skim off some sequences from the top of the file::
The problem here is that 'head' by default only selects the first 10 lines of a file, which may not be a complete set of FASTA records -- so you may have to tweak things. In this case, the first 14 lines are complete::
head -14 zebrafish.protein.faa
Let's take the output of 'head' and put it in a file,
zebrafish.top, that we can use for other purposes::
head -14 zebrafish.protein.faa > zebrafish.top
OK, great! Now let's run a BLASTP comparing these zebrafish sequences to the mouse proteins, and we'll put the results in a file
blastp -query zebrafish.top -db mouse.protein.faa -out xxx.txt
(The file name
xxx.txt is just a throwaway file name, something that we can look at and see is a test file. You can use your own convention; I usually go with something short and recognizably silly, like 'xxx', 'yyy', 'foo', etc.)
OK, now take a look at that file with
Yep, looks like BLAST output to me!
There's all sorts of things you can do to alter the BLAST output; run
blastp to get a list of those options. For example, '-evalue 1e-6' will set the e-value cutoff at 1e-6, above which nothing will be displayed.
Now let's run a bigger BLAST, all zebrafish proteins against all mouse proteins::
blastp -query zebrafish.protein.faa -db mouse.protein.faa -out zebrafish.x.mouse &
This is going to take a while, which is why we told the computer to give us back a command prompt while blastp runs (that's what the
So, how long is it going to take? We can guesstimate by looking at how many sequences have been processed since we started. To do that, run ::
grep Query= zebrafish.x.mouse
OK, that gives us all the query lines -- now what? Let's count them with
grep Query= zebrafish.x.mouse | wc -l
| is what's known as a 'pipe', telling the command line to take the output of
grep and send it to the command
wc, which counts characters, words, and lines. The
-l tells wc to count the lines only.
Compare that number to the number of sequences in the zebrafish protein database::
grep ^'>' zebrafish.protein.faa | more
to see the FASTA headers, and ::
grep ^'>' zebrafish.protein.faa | wc -l
to count all the sequences.
Last, but not least -- let's run a quick script to convert the file into a set of CSV matches::
python ~/Dropbox/ngs-scripts/blast/blast-to-csv.py zebrafish.x.mouse > ~/Dropbox/zebrafish-mouse.csv
Take a look at the script and see if you can understand what it does::
Before you leave for lunch:
Let's start a second BLAST, all of mouse against all of zebrafish::
blastp -query mouse.protein.faa -db zebrafish.protein.faa -out mouse.x.zebrafish &
...now the computer can work while we eat!
When we come back, we can work through a reciprocal BLAST example.
.. @@ save these files
.. # scripting etc. .. @@ install biopython, ez_seutp, blastkit??