Question

Make A Custom Blast Library Using The Output Of Another Blast Result

4

Entering edit mode

12.4 years ago

Zach Powers ▴ 340

Hi Biostar,

I am working on a microbial gene annotation project and I am interested in taking a large number of sequences (say 20,000) and blasting them against the NR database. However, even with a local copy of nr and a decent computer this will take forever (days, anyway). My sequences, however, are not random so I wanted to make a subset of NR that will allow me to perform a much faster BLAST query.

My sequences are 454 reads of PCR amplicons made with highly degenerate primers targeting a family of proteins so what I would like to do is as follows:

Blast a set of known sequences against nr.
Take the hits from this query and use the hits to make a new blastdb
Blast thousands of sequences against this smaller dataset and enjoy the massive speed gains.

So before reinventing the wheel, I was wondering if there is a trivially easy way to do this that someone knows about, or has done. In the meantime I will try writing a biopython script to run the blast and to use the 'gi' and 'sseqids' of the hits to make a new fasta file.

thanks, zach cp

blast makeblastdb ncbi • 6.3k views

ADD COMMENT • link updated 12.4 years ago by Sujai Kumar ▴ 270 • written 12.4 years ago by Zach Powers ▴ 340

0

Entering edit mode

In step 2 "take those sequences", you mean take the hits to those sequences? Otherwise the procedure doesn't make sense.

ADD REPLY • link 12.4 years ago by Neilfws 49k

0

Entering edit mode

@neilfws - yes thats what I mean. corrected. thanks

ADD REPLY • link 12.4 years ago by Zach Powers ▴ 340

score 7 · Answer 1 · 2011-12-04

7

Entering edit mode

12.4 years ago

Sujai Kumar ▴ 270

No need for a python script. The command line is your friend.

Assuming your initial sequences are in known_sequences.fa, first do the blastx against a local copy of nr (add your favourite blastx parameters) using the blast+ toolkit:

blastx -query known_sequences.fa -db nr -outfmt 6 >known_sequences.blastx.nr.hits.txt

Then, extract the GIs from the second column of the hits file:

cut -f2 known_sequences.blastx.nr.hits.txt | cut -f2 -d "|" >known_sequences.blastx.nr.gids.txt

Then make a blastdb using blastdb_aliastool:

blastdb_aliastool -dbtype prot -gilist known_sequences.blastx.nr.gids.txt -db nr -out nr_subset

ADD COMMENT • link 12.4 years ago by Sujai Kumar ▴ 270

2

Entering edit mode

Sujai just a tip, to extract the GI's instead of using two cut commands you could simply use awk:

 "awk -F"|" '{print $2}' known_sequences.blastx.nr.hits.txt > known_sequences.blastx.nr.gids.txt

Learning just the basics about awk or sed can really save quite some time on the long run.

ADD REPLY • link 12.4 years ago by Random ▴ 160

0

Entering edit mode

nice Sujal, Thanks! I will ty this. One downside ( i will check) of using the GIs is that many of the blast queries hit back large sequences (whole genomes, ESTs) but I think the speed gains will still be awesome ---- time to give it a go.

ADD REPLY • link 12.4 years ago by Zach Powers ▴ 340

0

Entering edit mode

I think you should consider ALchEmiXt's broader suggestions though - a taxon restricted database is more general than just picking the ones your sequences hit.

ADD REPLY • link 12.4 years ago by Sujai Kumar ▴ 270

0

Entering edit mode

Agreed, I was just being lazy ;-)

ADD REPLY • link 12.4 years ago by Sujai Kumar ▴ 270

score 5 · Answer 2 · 2011-12-04

5

Entering edit mode

12.4 years ago

ALchEmiXt ★ 1.9k

If you are interested in just a specific taxon you could limit your BLAST searches by taxonid. Furthermore the blast+ utilities will allow you to generate a subset multiFasta or DB from a list of gi's. More specifically the BLASTDBCMD command.

So there are several ways to go with this;

Limit blast to taxon-group
after gi extraction make a blastDB subset using blastdbcmd command
or usig your gi list retrieve all relevant sequences using Entrez online and use the multi-fasta as input DB in blast+ (it can do that) or convert it into a blastDB.
probably some more ways. Also depends on how often you want to update the subsets.

ADD COMMENT • link 12.4 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

In a completely different application, I need to limit my BLAST results to a particular taxon---say, danio rerio. How do I do this from the command line for the c++ version of blastn?

ADD REPLY • link 11.9 years ago by kevin.l.neff ▴ 340

0

Entering edit mode

Oh, found it.

Search the Entrez Protein database (or Nucleotide, etc) with query: "txid7742[ORGN]" or "danio rerio[orgn]

Select "Send to File" and choose format "GI list"

Then, for blastn, us -gilist <file>.

I figured this out from the following thread.

http://www.biostars.org/post/show/6528/vertebrate-subset-nr-database-build-my-own/

ADD REPLY • link 11.9 years ago by kevin.l.neff ▴ 340