Question

Bioinformatics: Create database for local BLAST+ alignment

0

Entering edit mode

4.0 years ago

tom5 • 0

Hi, I am trying to run BLAST+ alignment remotely, but the server keeps logging me out. I think a better strategy would be to run BLAST+ locally with a database. I am performing protein BLAST alignment on chicken and mouse genes, so I would like to set up a local version of a database (such as the nr database) with just those organisms. Please let me know if there's a way to do this.

BLAST R • 1.1k views

ADD COMMENT • link 4.0 years ago by tom5 • 0

score 1 · Answer 1 · 2020-05-03

1

Entering edit mode

4.0 years ago

Mensur Dlakic ★ 27k

For your task, a tutorial is here. You may want to read general BLAST manual as well.

ADD COMMENT • link 4.0 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

Thank you for your help. I looked at the tutorial and manual you linked and I don't think they fully answered my question. I am trying to run protein BLAST alignment locally through BLAST+ and want to download the nr database. However, due to the large size of the database, I'd like to only download the portion of the dataset that corresponds to organisms I am working with: Mouse and chicken.

I looked at the NCBI guide and it gives a command to download databases: update_blastdb.pl --decompress nr [*]

However, I am not sure how to specify an organism specific download. The tutorial you recommended recommends using makeblastdb to generate a database from FASTA files. How do I get the correct files to do so? Please let me know if you have a recommendation. Have a good evening.

ADD REPLY • link 4.0 years ago by tom5 • 0

1

Entering edit mode

However, due to the large size of the database, I'd like to only download the portion of the dataset that corresponds to organisms I am working with: Mouse and chicken.

No you can't do that. You are best off downloading mouse and chicken genome fasta files from NCBI datasets and then creating the database yourself.

ADD REPLY • link 4.0 years ago by GenoMax 142k

0

Entering edit mode

Hi, thanks again, this is a valuable resource. Unfortunately I am uncertain how to go from here to my BLAST database. What I've done so far is:

Downloaded the mouse data from NCBI datasets. In the download options, I chose Annotated features (GFF3) and Protein (FASTA), since I want to do protein BLAST. The data downloaded as a stacked .zip directory organized as 'mouse_data/ncbi_dataset/data/GCF_000001635.26'. The '000001635.26' folder contained multiple .fna files, as in the picture below.
I concatenated all of the .fna files with 'cat *.fna > mydb.fna'
I then ran makeblastdb on my data: makeblastdb -in mydb.fna -dbtype prot -out blast_db/blast_db
Next I downloaded a protein FASTA file to test protein BLAST: blastp -db blast_db -query atoh1_prot.fasta -out blast_output/atoh1_results.out
Though this command runs, it doesn't return. I've had it going for multiple minutes before quitting manually.

I'm not sure where my error is but I suspect I did not generate the dataset correctly. Please let me know if you can help.

ADD REPLY • link 4.0 years ago by tom5 • 0

0

Entering edit mode

Another issue I had is downloading the gallus gallus dataset from the NCBI datasets web browser: chicken. When I tried to download, it redirected me to an empty page and the download did not start.

ADD REPLY • link 4.0 years ago by tom5 • 0

0

Entering edit mode

Files with .fna extension are genomic DNA files. If you want to search with protein queries against it, you will need to use tblastn instead of blastp. I don't suggest doing that with raw eukaryotic genomes because of splicing. Instead, you can find files with .faa extension which will contain proteins. Looks like you already have protein.faa in your collection, so use that as input for makeblastdb and for a subsequent blastp search.

As to why nothing happens in your search, DNA sequence can pose as fake protein sequence, since all DNA bases are legitimate protein residues. Given that DNA databases are at least 3x larger than corresponding proteins databases (and in case of eukaryotes more like 50-100x larger), the search you initiated is simply taking a long time. It would likely finish if you gave it some time, though the results would be meaningless as you are comparing a protein sequence to a DNA database acting as a fake protein database.

ADD REPLY • link 4.0 years ago by Mensur Dlakic ★ 27k