Question: Make A Custom Blast Library Using The Output Of Another Blast Result
4
gravatar for Zach Powers
7.6 years ago by
Zach Powers340
Zach Powers340 wrote:

Hi Biostar,

I am working on a microbial gene annotation project and I am interested in taking a large number of sequences (say 20,000) and blasting them against the NR database. However, even with a local copy of nr and a decent computer this will take forever (days, anyway). My sequences, however, are not random so I wanted to make a subset of NR that will allow me to perform a much faster BLAST query.

My sequences are 454 reads of PCR amplicons made with highly degenerate primers targeting a family of proteins so what I would like to do is as follows:

  1. Blast a set of known sequences against nr.
  2. Take the hits from this query and use the hits to make a new blastdb
  3. Blast thousands of sequences against this smaller dataset and enjoy the massive speed gains.

So before reinventing the wheel, I was wondering if there is a trivially easy way to do this that someone knows about, or has done. In the meantime I will try writing a biopython script to run the blast and to use the 'gi' and 'sseqids' of the hits to make a new fasta file.

thanks, zach cp

makeblastdb ncbi blast • 4.3k views
ADD COMMENTlink modified 7.6 years ago by Sujai Kumar240 • written 7.6 years ago by Zach Powers340

In step 2 "take those sequences", you mean take the hits to those sequences? Otherwise the procedure doesn't make sense.

ADD REPLYlink written 7.6 years ago by Neilfws48k

@neilfws - yes thats what I mean. corrected. thanks

ADD REPLYlink written 7.6 years ago by Zach Powers340
7
gravatar for Sujai Kumar
7.6 years ago by
Sujai Kumar240
United Kingdom
Sujai Kumar240 wrote:

No need for a python script. The command line is your friend.

Assuming your initial sequences are in known_sequences.fa, first do the blastx against a local copy of nr (add your favourite blastx parameters) using the blast+ toolkit:

blastx -query known_sequences.fa -db nr -outfmt 6 >known_sequences.blastx.nr.hits.txt

Then, extract the GIs from the second column of the hits file:

cut -f2 known_sequences.blastx.nr.hits.txt | cut -f2 -d "|" >known_sequences.blastx.nr.gids.txt

Then make a blastdb using blastdb_aliastool:

blastdb_aliastool -dbtype prot -gilist known_sequences.blastx.nr.gids.txt -db nr -out nr_subset
ADD COMMENTlink written 7.6 years ago by Sujai Kumar240
2

Sujai just a tip, to extract the GI's instead of using two cut commands you could simply use awk:

 "awk -F"|" '{print $2}' known_sequences.blastx.nr.hits.txt > known_sequences.blastx.nr.gids.txt

Learning just the basics about awk or sed can really save quite some time on the long run.

ADD REPLYlink written 7.6 years ago by Random160

nice Sujal, Thanks! I will ty this. One downside ( i will check) of using the GIs is that many of the blast queries hit back large sequences (whole genomes, ESTs) but I think the speed gains will still be awesome ---- time to give it a go.

ADD REPLYlink written 7.6 years ago by Zach Powers340

I think you should consider ALchEmiXt's broader suggestions though - a taxon restricted database is more general than just picking the ones your sequences hit.

ADD REPLYlink written 7.6 years ago by Sujai Kumar240

Agreed, I was just being lazy ;-)

ADD REPLYlink written 7.6 years ago by Sujai Kumar240
5
gravatar for ALchEmiXt
7.6 years ago by
ALchEmiXt1.9k
The Netherlands
ALchEmiXt1.9k wrote:

If you are interested in just a specific taxon you could limit your BLAST searches by taxonid. Furthermore the blast+ utilities will allow you to generate a subset multiFasta or DB from a list of gi's. More specifically the BLASTDBCMD command.

So there are several ways to go with this;

  1. Limit blast to taxon-group
  2. after gi extraction make a blastDB subset using blastdbcmd command
  3. or usig your gi list retrieve all relevant sequences using Entrez online and use the multi-fasta as input DB in blast+ (it can do that) or convert it into a blastDB.
  4. probably some more ways. Also depends on how often you want to update the subsets.
ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by ALchEmiXt1.9k

In a completely different application, I need to limit my BLAST results to a particular taxon---say, danio rerio. How do I do this from the command line for the c++ version of blastn?

ADD REPLYlink written 7.1 years ago by kevin.l.neff200

Oh, found it.

Search the Entrez Protein database (or Nucleotide, etc) with query: "txid7742[ORGN]" or "danio rerio[orgn]

Select "Send to File" and choose format "GI list"

Then, for blastn, us -gilist <file>.

I figured this out from the following thread.

http://www.biostars.org/post/show/6528/vertebrate-subset-nr-database-build-my-own/

ADD REPLYlink written 7.1 years ago by kevin.l.neff200
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 828 users visited in the last hour