Question: How to download just the genomes I want for blast+: currently using update_blastdb.pl
0
gravatar for Jacob
3 months ago by
Jacob10
Jacob10 wrote:

right now I'm running

update_blastdb.pl --timeout 300 refseq_genomic.

But this takes up hundreds of GB on my computer. I'm wondering if there is a way to get just the genomes I want For example, if I just want the genomes for Gallus gallus, Mus musculus, and Homo sapiens how can I do something similar to get just those genomes.

Explain things if you can I'm pretty new at doing this and not very good at trying to link ftp databases to my blast searches.

ADD COMMENTlink modified 3 months ago • written 3 months ago by Jacob10
1
gravatar for genomax
3 months ago by
genomax32k
United States
genomax32k wrote:

Get those genomes (from NCBI genomes FTP site, (you could cat the chromosome files together) and build the blast index yourself using makeblastdb.

Otherwise UCSC has full fasta format genome files (as single file downloads, all chromosomes already in one file). For human, Mouse and Chicken. Making your own blast database is the same as above and is explained in this manual.

ADD COMMENTlink modified 3 months ago • written 3 months ago by genomax32k

Thank-you very much I've tried doing this method, but cannot execute it right and I do not know why

server:database user$ ~/homebrew/bin/wget https://ftp.ncbi.nlm.nih.gov/genomes/Mus_musculus
--2017-05-18 15:35:48--  https://ftp.ncbi.nlm.nih.gov/genomes/Mus_musculus
Resolving ftp.ncbi.nlm.nih.gov... 2607:f220:41e:250::7, 130.14.250.12
Connecting to ftp.ncbi.nlm.nih.gov|2607:f220:41e:250::7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3656 (3.6K) [text/html]
Saving to: ‘Mus_musculus’

Mus_musculus                                      100%[==========================================================================================================>]   3.57K  --.-KB/s    in 0s      

2017-05-18 15:35:48 (63.4 MB/s) - ‘Mus_musculus’ saved [3656/3656]

I then follow up this command with the following and get errors which I do not know how to tackle

server:database user$ cd ..

option 1

server:BlastFolder user$ makeblastdb -in database/Mus_musculus -out database/mouse_genome -dbtype nucl
Building a new DB, current time: 05/18/2017 15:38:55
New DB name:   /Users/user/Desktop/BlastFolder/database/mouse_genome
New DB title:  database/Mus_musculus
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
BLAST options error: database/Mus_musculus does not match input format type, default input type is FASTA

option 2

 server:BlastFolder user$ makeblastdb -in database/Mus_musculus -out database/mouse_genome -dbtype nucl -input_type blastdb
    BLAST Database error: No alias or index file found for nucleotide database [database/Mus_musculus] in search path [/Users/user/Desktop/BlastFolder::]
ADD REPLYlink modified 3 months ago • written 3 months ago by Jacob10
1

Your command is wrong since it does not address a specific file.

I suggest that you use the UCSC links I provided to make your life simpler. The command in that case should be wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz wget http://hgdownload.soe.ucsc.edu/goldenPath/galGal5/bigZips/galGal5.fa.gz

After you download the files you will need to gunzip/tar or tar -avf them to uncompress them. That will be followed by cating the three genome files together cat hg38.chromFa.fa mouse.fa chicken.fa > giant_genome.fa

Finally run mkblastdb -i giant_genome.fa etc to make the database.

Use real file names when cat'ing and appropriate options for mkblastdb when you run the final command.

Note: If you want to make separate databases for the three genomes then don't do the cat step.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax32k

Thank-you, a few questions though

Main problem I'm still getting an error with my makeblastdb command gunzip galGal5.fa.gz

I assume you meant to type -in because I have no -I option. When I use -in as you did I get this error

makeblastdb -in 'database/galGal5.fa'

USAGE
  makeblastdb [-h] [-help] [-in input_file] [-input_type type]
..
..
Error: Argument "dbtype". Mandatory value is missing:  `String, `nucl', `prot''
Error:  (CArgException::eNoArg) Argument "dbtype". Mandatory value is missing:  `String, `nucl', `prot''

When I add in some of these mandatory values I still get an error

server:BlastFolder user$ makeblastdb -in 'database/galGal5.fa' -out database/chicken_genome -dbtype nucl -input_type blastdb -title "Chicken_genome"

Building a new DB, current time: 05/18/2017 17:35:26
New DB name:   /Users/user/Desktop/BlastFolder/database/chicken_genome
New DB title:  Chicken_genome
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Error: [makeblastdb] Unable to open input database/galGal5.fa as BLAST db
BLAST Database error: No alias or index file found for nucleotide database [database/galGal5.fa] in search path [/Users/user/Desktop/BlastFolder::]

Extra questions.

If I want to get the genomes from the ncbi link I posted, how can I get the specific link

Is that suppose to be tar -avf ? My tar has no -a option

ADD REPLYlink modified 3 months ago • written 3 months ago by Jacob10
1

You will need to go into individual chromosome directories and get the *fa.gz file for each (e.g. Chr1 for Mouse).

Use the UCSC method above. It will save you a bunch of time. Sequence is identical no matter where you get it from.

If you need a primer for unix then I suggest that you spend some time at this site.

ADD REPLYlink written 3 months ago by genomax32k
1

If I want to get the genomes from the ncbi link I posted, how can I get the specific link

Trying to extract genomes you need from blast index for nt or refseq_genomic would be a much more tedious undertaking. You can't do it on the fly so to speak. You will need to download the entire index locally and then do the extraction. The method I described here is more straightforward.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax32k

Thank-you so much for your help. I edited the comment because it still wasn't working, but I think I just need to change the dbtype to fasta

ADD REPLYlink modified 3 months ago • written 3 months ago by Jacob10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1326 users visited in the last hour