Question

Download Genomes Of All Sequenced Genomes (Draft Or Complete) Within A Phyla From Ncbi Or Jgi?

0

Entering edit mode

10.5 years ago

microbeatic ▴ 80

The genomes in the NCBI ftp site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) is listed in alphabetic order with bioproject id at the end. And, there is no taxonomic information in the name. Is there a way to only download genomes that belongs to specific phyla? For example, how do i download all the genome folders that belong to Actinobacteria.

ncbi genome bacteria • 8.8k views

ADD COMMENT • link updated 10.5 years ago by Phil S. ▴ 700 • written 10.5 years ago by microbeatic ▴ 80

score 2 · Answer 1 · 2013-11-06

2

Entering edit mode

10.5 years ago

Pierre Lindenbaum 161k

" And, there is no taxonomic information in the name"

wrong: you can find the taxon in ftp://ftp.ncbi.nih.gov/genomes/Bacteria/summary.txt

for example: the file Acaryochloris_marina_MBIC11017_uid58167

Accession    GenbankAcc    Length    Taxid    ProjectID    TaxName    Replicon    Create Date    Update Date
NC_009926.1    CP000838.1    374161    329726    58167    Acaryochloris marina MBIC11017    plasmid pREB1    Oct 17 2007    Jun 10 2013  7:03:09:346PM

and in http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=329726&retmode=xml

 <Lineage>cellular organisms; Bacteria; Cyanobacteria; Oscillatoriophycideae; Chroococcales; Acaryochloris; Acaryochloris marina</Lineage>

ADD COMMENT • link 10.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Wow, didn't realized this. My bad. Thanks. Is there a summary information for DRAFT too?. I scrolled through the folder but didn't see it.

ADD REPLY • link 10.5 years ago by microbeatic ▴ 80

0

Entering edit mode

for planctomycete_KSU_1_uid163683 , I found it in the gbk file "ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/planctomycete_KSU_1_uid163683/NZ_BAFH00000000.gbk " /db_xref="taxon:247490"

ADD REPLY • link 10.5 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

yes, thats for individual genome. A summary like the ones for complete genome ftp would have been better.

Right now i have a rsync set up between ftp and my database. But the list of genomes is only semi automatic right now. I can use the summary file in complete genome ftp to create a list with actinos and make the list totally automatic, but I am confused on how would i do it for the draft genome ftp. Do i have to read in all .gbk files for each organism in that folder?

ADD REPLY • link 10.5 years ago by microbeatic ▴ 80

score 0 · Answer 2 · 2013-11-06

Hi, maybe this perl script solves your problem:

# This script downloads all genomes of the given organism in RefSeq and puts them in organism.fa
# Script is taken from: http://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large 

use LWP::Simple;


if($#ARGV + 1 > 0) {
    $organism = $ARGV[0];
} else {
    $organism = 'Fungi';
}

$query = $organism.'[orgn]+AND+srcdb_refseq[prop]';
print STDERR "Searching RefSeq for $organism: $query\n";
#assemble the esearch URL
$base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "esearch.fcgi?db=nucleotide&term=$query&usehistory=y";


#post the esearch URL
$output = get($url);


#parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);

print STDERR "Found: $count records for $organism\n"; 
if($count == 0) {
    exit(0);
}

#open output file for writing
open(OUT, ">tmp.$organism.fa") || die "Can't open file!\n";


#retrieve data in batches of 500
$retmax = 500;
for ($ret = 0; $ret < $count; ) {
    $efetch_url = $base ."efetch.fcgi?db=nucleotide&WebEnv=$web";
    $efetch_url .= "&query_key=$key&retstart=$ret";
    $efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
    $efetch_out = get($efetch_url);
    $actual_sequences_returned = $efetch_out =~ s/>/\n>/g;  # count number of sequences returned
    $ret += $actual_sequences_returned;
    print OUT "$efetch_out";
    print STDERR "Fetched $ret\n";
}
close OUT;

rename("tmp.$organism.fa", "$organism.fa");

it is used by:

perl scriptname organismname

in your case

perl scriptname Actinobacteria

Default behaviour is to download fungi...

cheers

ps. see also Ncbi Refseq Viral Genomes