Question: Downloading Fasta Files
3
gravatar for Mcdenzlix
9.1 years ago by
Mcdenzlix50
Mcdenzlix50 wrote:

i need to download about 40 complete genomes from ncbi and still filter out sequences between specified bps(like btn 1000bp to 3000bp) from the genomes separately. i need help on how to do that. i would also like to blast some sequences against each of the downloaded genomes to check for presence absence of the querries. please assist or give best guidelines

ADD COMMENTlink modified 5.7 years ago by Biostar ♦♦ 20 • written 9.1 years ago by Mcdenzlix50
5
gravatar for Maximilian Haeussler
9.1 years ago by
UCSC
Maximilian Haeussler1.3k wrote:

not tested as you didn't post an example, use this only as a starting point:

URL=http://www.ncbi.org/pub/genomes
GENOMELIST=E_coli.fa.gz E_coli_strain2.fa.gz
INSEQFILE=myLocalFastaFileToBlast.fa
mkdir download
mkdir filtered
mkdir blast

for i in ${GENOMELIST}; do
  wget ${URL}/$i -O download/$i;
  gunzip download/$i;
  faFilter -minSize=1000 -maxSize=3000 download/$i filtered/$i;
  formatdb -i filtered/$i -p F;
  blastall -p blastn -i ${INSEQFILE} -o blast/$i.blast -e 0.000001;
done

faFilter is from the UCSC source code collection, see http://genome.ucsc.edu/admin/jk-install.html or also http://genomewiki.ucsc.edu/index.php/The_source_tree

ADD COMMENTlink modified 11 weeks ago by RamRS24k • written 9.1 years ago by Maximilian Haeussler1.3k
3
gravatar for Lee Katz
9.1 years ago by
Lee Katz3.0k
Atlanta, GA
Lee Katz3.0k wrote:

Per usual, BioPerl has the answer.

http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/DB/GenBank.html

# you could make an array of IDs you need to fetch
use Bio::DB::GenBank;
$gb = Bio::DB::GenBank->new();
$seq = $gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID
@seqCoords=(
  [0, 100],
  [1000-1100]
);
$subseq=$seq->subseq($$seqCoords[0][0],$$seqCoords[0][1]);
# then, look at the blast modules and SearchIO to see how to start blasting and parsing
# http://www.bioperl.org/wiki/HOWTOs
ADD COMMENTlink modified 11 weeks ago by RamRS24k • written 9.1 years ago by Lee Katz3.0k
2
gravatar for Pierre Lindenbaum
9.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

You can download your genomes, build a BLAST database with formatdb and then extract a second set of sequences using fastacmd:

ncbi/build/fastacmd has a option -L

  -L  Range of sequence to extract (Format: start,stop)
      0 in 'start' refers to the beginning of the sequence
      0 in 'stop' refers to the end of the sequence [String]  Optional
    default = 0,0

then run your blastall query with the second database.

ADD COMMENTlink modified 11 weeks ago by RamRS24k • written 9.1 years ago by Pierre Lindenbaum124k
0
gravatar for Casbon
9.1 years ago by
Casbon3.2k
Casbon3.2k wrote:

Might help: http://www.dcode.org/sequences.php

ADD COMMENTlink written 9.1 years ago by Casbon3.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1182 users visited in the last hour