I need to download about 40 complete genomes from ncbi and still filter out sequences between specified bps(like btn 1000bp to 3000bp) from the genomes separately. I need help on how to do that. I would also like to blast some sequences against each of the downloaded genomes to check for presence absence of the queries.
# you could make an array of IDs you need to fetch
use Bio::DB::GenBank;
$gb = Bio::DB::GenBank->new();
$seq = $gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID
@seqCoords=(
[0, 100],
[1000-1100]
);
$subseq=$seq->subseq($$seqCoords[0][0],$$seqCoords[0][1]);
# then, look at the blast modules and SearchIO to see how to start blasting and parsing
# http://www.bioperl.org/wiki/HOWTOs
You can download your genomes, build a BLAST database with formatdb and then extract a second set of sequences using fastacmd:
ncbi/build/fastacmd has a option -L
-L Range of sequence to extract (Format: start,stop)
0 in 'start' refers to the beginning of the sequence
0 in 'stop' refers to the end of the sequence [String] Optional
default = 0,0
then run your blastall query with the second database.