Vertebrate Subset Nr Database? Build My Own?
3
52
Entering edit mode
13.1 years ago
Jason ▴ 430

I think I have the answer to my own question, but I'm somewhat new to bioinformatics, and I want to make sure my strategy is sound (and that there are no easier solutions to my problem).

We need to search against the vertebrate subset of the nr database. Hence I've been tasked with finding or building a vertebrate subset of the nr database. So far, I can't find one. So to build one, I'm planning on doing the following.

  1. Get the nr database (I believe I have the one from ncbi or ensembl).
  2. Get the taxonomy database from ncbi.
  3. Somehow traverse the taxonomy db, and extract the taxids of all children of the parent vertebrate node (looks like the parent vertebrate is taxid 7742 from names.dump from taxdump.tar.gz) (see ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt).
  4. Use the list of vertebrate taxids from the previous step to extract the GIs of proteins from gi_taxid_prot.dmp.gz (see ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid.readme).
  5. Use the list of vertebrate GIs from the previous step with the blastdb_aliastool to build an aliased blastdb of just vertebrates. Something along the lines of this:

    blastdb_aliastool -gilist vertebrate_gis.txt -db nr -out nr_vertebrates -title nr_vertebrates

Am I on the right track? Am I reinventing the wheel? Does the vertebrate gi_list or vertebrate taxid_list that I'm trying to build already exist elsewhere?

Thanks!


Update It appears that blastx via the ncbi web page allows you to specify a taxid via their "Organism" field. However, the command line version of blastx doesn't offer that functionality (at least my blastx doesn't).

In addition, the command line blastn tool does allow you to set -window_masker_taxid 7742, but there isn't an equivalent for command line blastx.

It's odd that the web version and the command line version are so different. Does the web version simply do a full nr search then return only those results that match the taxid argument? That would seem odd. But if it does use the actual blast database to filter out unwanted taxonomies, why isn't that functionality included with the command line version of the tool?

(We're wanting to run 100,000 - 1,000,000 blastx queries, so doing this over the web isn't practical.)


Update Thanks to neilfws for pointing out the gilist from entrez trick! That saved me a lot of time and allowed me to implement a solution immediately. This is what I went with:

  1. Download the prebuilt nr database (ncbi?).
  2. Search the Entrez Protein database with query: "txid7742[ORGN]"
  3. Select "Send to File" and choose format "GI list"
  4. Use the list of vertebrate GIs from the previous step with the blastdb_aliastool to build an aliased blastdb of just vertebrates (takes several seconds):

    blastdb_aliastool -gilist vertebrates.gi_list.txt -db nr -out nr_vertebrates -title nr_vertebrates

  5. Search against your new (aliased) database:

    blastx -query query.fa -db nr_vertebrates

This seems to be working for us. The queries return good hits, and (after loading into cache) the time to search has been reduced from 6.5 minutes (all of nr) to 1.0 minutes (just vertebrate nr). I'm not sure if it would be faster taking the more native approach suggested by neilfws (I haven't gone to the trouble to test that out), but I'm happy enough with my speedier blastx! Thanks again neilfws!

blast database • 24k views
ADD COMMENT
0
Entering edit mode

Hi, I tried blastdbaliastool -gilist vertebrates.gilist.txt -db nr -out nrvertebrates -title nrvertebrates

but I got an error "BLAST Database error: GI list specified but no ISAM file found for GI."

Any idea what this is? Thanks so much

ADD REPLY
0
Entering edit mode

Hi, Thanks for providing your updates, they helped me to create my database.

Quick comment/question: I noticed that the number of sequences in my newly created database was smaller than the original number of converted GIs.

"Converted 441134 GIs from /sequences.gi.txt to binary format in nt_database.n.gil
Created nucleotide BLAST (alias) database nt_database with 21972 sequences."

Everything works fine but I was wondering if this was normal? For example, maybe there are multiple identifical GIs that are attributed to a specific sequences in the new database?

I found this blog post and the author showed similar results but didn't mention any issue (http://johnstantongeddes.org/aptranscriptome/2013/12/31/notes.html)

"Converted 1749040 GIs from insecta_gi.txt to binary format in nr_insecta.p.gil Created protein BLAST (alias) database nr_insecta with 838606 sequences"

Any insights appreciated.

ADD REPLY
1
Entering edit mode

This has been confirmed by the NCBI help (through email) that this is normal and probably caused by redundant sequences matching to single GIs.

ADD REPLY
35
Entering edit mode
13.1 years ago
Neilfws 49k

You are definitely on the right track; the problem with this type of task is that there are several alternatives at each step, leading to multiple ways to do it.

Here's how I might approach it. Note that I am old-fashioned and have not moved to BLAST+ so these commands are for BLAST 2.2.21.

  1. Download the nr protein FASTA file from ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz
  2. Unpack it: gunzip nr.gz
  3. Format as BLAST database: formatdb -i nr -o T -p T (a)
  4. Search the Entrez Protein database with query: "txid7742[ORGN]"
  5. Select "Send to File" and choose format "GI list" (b)
  6. Now, you can use fastacmd to dump sequences from the NR database and supply the GI list as an argument to dump only vertebrates. Assuming the GI list file is named gi.txt: fastacmd -d nr -i gi.txt -o vertebrata.fa
  7. Finally, use formatdb again to create the vertebrate BLAST database: formatdb -i vertebrata.fa -o T -p T -n vertebrata

(a) Note: you can also download pre-formatted BLAST DB files from ftp://ftp.ncbi.nih.gov/blast/db/, but I prefer to start from FASTA.

(b) This will take some time. You can also use EUtils to search/retrieve.

ADD COMMENT
0
Entering edit mode

Nice answer! Just to mention, EUtils route takes longer to finish than local blast.

ADD REPLY
0
Entering edit mode

Ah, steps 3 and 4 were very helpful! That just saved me a ton of time and coding effort. Thank you!

ADD REPLY
0
Entering edit mode

Ah, steps 4 and 5 were very helpful! That just saved me a ton of time and coding effort. Thank you!

ADD REPLY
0
Entering edit mode

@neilfws: "-D 1" will dump entire database as fasta (will ignore -i option); It has to be "fastacmd -d nr -i gi.txt -o vertebrata.fa"

ADD REPLY
0
Entering edit mode

Yes you're correct RM, well-spotted. Answer edited.

ADD REPLY
0
Entering edit mode

Hi if I want to use it on Bowtie2, than what should I do?

ADD REPLY
3
Entering edit mode
10.2 years ago

A blast+ solution to this problem is the following

  1. download your database from the NCBI ftp website (e.g. nt)
  2. download a list of GI identifiers from entrez
  3. extract the fasta sequences using blastdbcmd

    blastdbcmd -db nt -dbtype 'nucl' -entry_batch $gilist -out vertebrates.fa
    
  4. create a new reduced blastdb with the new fasta sequences

    makeblastdb -in vertebrates.fa -dbtype 'nucl' -title vertebrates
    
ADD COMMENT
0
Entering edit mode

I have try this but it says that "BLAST Database error: No alias or index file found for nucleotide database [/home/hadoop/NT/nt] in search path [/home/hadoop/NT::]"

ADD REPLY
0
Entering edit mode

NCBI has (or is about to) stopped supporting gi numbers. So any and all these solutions that use gi numbers are going to be invalid. My assumption is that is the case here.

Are you trying to create a subset of vertebrate sequences from nt database?

ADD REPLY
0
Entering edit mode

I am try to create a nt subset of bacteria for Bowtie2 to mapping. Could you give me some tips?

ADD REPLY
0
Entering edit mode
9.8 years ago
jcsoellner • 0

Actually I think your own original answer has its merits. When using blastdbcmd to extract sequences via gi

I'm under the impression it would extract the same sequence twice if multiple gi numbers share the same sequence, which is commonly the case.

Unfortunately blastp/blast only seem to allow a list of gi identifiers as positive/negative target selections on the command line via the -gilist and -negative_gilist options.

So this is what I did (which unfortunately seemed to be necessary) and should work for you as well:

  1. Download the NCBI taxonomy file taxdmp.zip (ftp://ftp.ncbi.nih.gov/pub/taxonomy/) and extracted nodes.dmp, use that to recursively build the tree.
  2. Download gi_taxid_prot.zip to map gi to taxonomy identifiers
  3. Identify gi numbers matching your taxonomy restriction. As you have the entire tree you can also easily include/exclude subgroups, e.g. include all eukaryota but exclude plants.

    [Note: 1 to 3 can be simplified if you just download a list of gi identifiers from entrez as described in previous statements; that is certainly preferable if you just want all eukaryota without any exceptions or experimenting with subsets]

  4. Download nr (in FASTA format)
  5. Go through the nr FASTA file sequence by sequence and select only those which match your gi list

While significantly more hassle than using the gi lists and blastdbcmd with -entry_batch option this avoids duplicated sequences. Possibly the problem is less substantial for nucleic acid sequences which you work with, but for proteins the redundancy seems to be a real problem.

I have to say I did not try it with gi lists downloaded from entrez, I directly started with those in gi_taxid_prot.zip, so it might be NCBI provides deredundified identifier lists (in the sense of only one per unique sequence). As they can belong to different taxa I would not suspect that however.

Another option is to BLAST against the entire nr/nt and use only those hits which match your subset of interest. This also does not work automatically but is fairly simple when the taxonomy tree has been parsed and you have the gi to taxon map (for technical reasons I put the latter in a mysql database and queried only those I needed, saves a lot of RAM).

Disadvantage is that it is slower and statistically based on a significantly larger database, but is reasonable if you want to see matches in a large number of diverse taxonomic subgroups or you test diverse subgroups in a series of runs.

ADD COMMENT

Login before adding your answer.

Traffic: 2813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6