I am working on a microbial gene annotation project and I am interested in taking a large number of sequences (say 20,000) and blasting them against the NR database. However, even with a local copy of nr and a decent computer this will take forever (days, anyway). My sequences, however, are not random so I wanted to make a subset of NR that will allow me to perform a much faster BLAST query.
My sequences are 454 reads of PCR amplicons made with highly degenerate primers targeting a family of proteins so what I would like to do is as follows:
- Blast a set of known sequences against nr.
- Take the hits from this query and use the hits to make a new blastdb
- Blast thousands of sequences against this smaller dataset and enjoy the massive speed gains.
So before reinventing the wheel, I was wondering if there is a trivially easy way to do this that someone knows about, or has done. In the meantime I will try writing a biopython script to run the blast and to use the 'gi' and 'sseqids' of the hits to make a new fasta file.
thanks, zach cp