Question

Downloading Fasta Files

3

Entering edit mode

13.8 years ago

Mcdenzlix ▴ 50

I need to download about 40 complete genomes from ncbi and still filter out sequences between specified bps(like btn 1000bp to 3000bp) from the genomes separately. I need help on how to do that. I would also like to blast some sequences against each of the downloaded genomes to check for presence absence of the queries.

Please assist or give best guidelines

fasta blast genome sequence • 8.8k views

ADD COMMENT • link updated 16 months ago by Ram 44k • written 13.8 years ago by Mcdenzlix ▴ 50

Ram · Answer 1 · 2010-10-25

not tested as you didn't post an example, use this only as a starting point:

URL=http://www.ncbi.org/pub/genomes
GENOMELIST=E_coli.fa.gz E_coli_strain2.fa.gz
INSEQFILE=myLocalFastaFileToBlast.fa
mkdir download
mkdir filtered
mkdir blast

for i in ${GENOMELIST}; do
  wget ${URL}/$i -O download/$i;
  gunzip download/$i;
  faFilter -minSize=1000 -maxSize=3000 download/$i filtered/$i;
  formatdb -i filtered/$i -p F;
  blastall -p blastn -i ${INSEQFILE} -o blast/$i.blast -e 0.000001;
done

faFilter is from the UCSC source code collection, see http://genome.ucsc.edu/admin/jk-install.html or also http://genomewiki.ucsc.edu/index.php/The_source_tree

Ram · Answer 2 · 2010-10-25

Per usual, BioPerl has the answer.

http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/GenBank.html

# you could make an array of IDs you need to fetch
use Bio::DB::GenBank;
$gb = Bio::DB::GenBank->new();
$seq = $gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID
@seqCoords=(
  [0, 100],
  [1000-1100]
);
$subseq=$seq->subseq($$seqCoords[0][0],$$seqCoords[0][1]);
# then, look at the blast modules and SearchIO to see how to start blasting and parsing
# http://www.bioperl.org/wiki/HOWTOs

Ram · Answer 3 · 2010-10-25

You can download your genomes, build a BLAST database with formatdb and then extract a second set of sequences using fastacmd:

ncbi/build/fastacmd has a option -L

  -L  Range of sequence to extract (Format: start,stop)
      0 in 'start' refers to the beginning of the sequence
      0 in 'stop' refers to the end of the sequence [String]  Optional
    default = 0,0

then run your blastall query with the second database.

score 0 · Answer 4 · 2010-10-25

0

Entering edit mode

13.8 years ago

Casbon ★ 3.3k

Might help: http://www.dcode.org/sequences.php

ADD COMMENT • link 13.8 years ago by Casbon ★ 3.3k