Make A Custom Blast Library Using The Output Of Another Blast Result
2
4
Entering edit mode
12.4 years ago
Zach Powers ▴ 340

Hi Biostar,

I am working on a microbial gene annotation project and I am interested in taking a large number of sequences (say 20,000) and blasting them against the NR database. However, even with a local copy of nr and a decent computer this will take forever (days, anyway). My sequences, however, are not random so I wanted to make a subset of NR that will allow me to perform a much faster BLAST query.

My sequences are 454 reads of PCR amplicons made with highly degenerate primers targeting a family of proteins so what I would like to do is as follows:

  1. Blast a set of known sequences against nr.
  2. Take the hits from this query and use the hits to make a new blastdb
  3. Blast thousands of sequences against this smaller dataset and enjoy the massive speed gains.

So before reinventing the wheel, I was wondering if there is a trivially easy way to do this that someone knows about, or has done. In the meantime I will try writing a biopython script to run the blast and to use the 'gi' and 'sseqids' of the hits to make a new fasta file.

thanks, zach cp

blast makeblastdb ncbi • 6.3k views
ADD COMMENT
0
Entering edit mode

In step 2 "take those sequences", you mean take the hits to those sequences? Otherwise the procedure doesn't make sense.

ADD REPLY
0
Entering edit mode

@neilfws - yes thats what I mean. corrected. thanks

ADD REPLY
7
Entering edit mode
12.4 years ago
Sujai Kumar ▴ 270

No need for a python script. The command line is your friend.

Assuming your initial sequences are in known_sequences.fa, first do the blastx against a local copy of nr (add your favourite blastx parameters) using the blast+ toolkit:

blastx -query known_sequences.fa -db nr -outfmt 6 >known_sequences.blastx.nr.hits.txt

Then, extract the GIs from the second column of the hits file:

cut -f2 known_sequences.blastx.nr.hits.txt | cut -f2 -d "|" >known_sequences.blastx.nr.gids.txt

Then make a blastdb using blastdb_aliastool:

blastdb_aliastool -dbtype prot -gilist known_sequences.blastx.nr.gids.txt -db nr -out nr_subset
ADD COMMENT
2
Entering edit mode

Sujai just a tip, to extract the GI's instead of using two cut commands you could simply use awk:

 "awk -F"|" '{print $2}' known_sequences.blastx.nr.hits.txt > known_sequences.blastx.nr.gids.txt

Learning just the basics about awk or sed can really save quite some time on the long run.

ADD REPLY
0
Entering edit mode

nice Sujal, Thanks! I will ty this. One downside ( i will check) of using the GIs is that many of the blast queries hit back large sequences (whole genomes, ESTs) but I think the speed gains will still be awesome ---- time to give it a go.

ADD REPLY
0
Entering edit mode

I think you should consider ALchEmiXt's broader suggestions though - a taxon restricted database is more general than just picking the ones your sequences hit.

ADD REPLY
0
Entering edit mode

Agreed, I was just being lazy ;-)

ADD REPLY
5
Entering edit mode
12.4 years ago
ALchEmiXt ★ 1.9k

If you are interested in just a specific taxon you could limit your BLAST searches by taxonid. Furthermore the blast+ utilities will allow you to generate a subset multiFasta or DB from a list of gi's. More specifically the BLASTDBCMD command.

So there are several ways to go with this;

  1. Limit blast to taxon-group
  2. after gi extraction make a blastDB subset using blastdbcmd command
  3. or usig your gi list retrieve all relevant sequences using Entrez online and use the multi-fasta as input DB in blast+ (it can do that) or convert it into a blastDB.
  4. probably some more ways. Also depends on how often you want to update the subsets.
ADD COMMENT
0
Entering edit mode

In a completely different application, I need to limit my BLAST results to a particular taxon---say, danio rerio. How do I do this from the command line for the c++ version of blastn?

ADD REPLY
0
Entering edit mode

Oh, found it.

Search the Entrez Protein database (or Nucleotide, etc) with query: "txid7742[ORGN]" or "danio rerio[orgn]

Select "Send to File" and choose format "GI list"

Then, for blastn, us -gilist <file>.

I figured this out from the following thread.

http://www.biostars.org/post/show/6528/vertebrate-subset-nr-database-build-my-own/

ADD REPLY

Login before adding your answer.

Traffic: 1869 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6