Question: Download Genomes Of All Sequenced Genomes (Draft Or Complete) Within A Phyla From Ncbi Or Jgi?
0
gravatar for microbeatic
5.8 years ago by
microbeatic80
hanover, NH
microbeatic80 wrote:

The genomes in the NCBI ftp site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/) is listed in alphabetic order with bioproject id at the end. And, there is no taxonomic information in the name. Is there a way to only download genomes that belongs to specific phyla? For example, how do i download all the genome folders that belong to Actinobacteria.

ncbi genome bacteria • 6.4k views
ADD COMMENTlink modified 5.8 years ago by Phil S.660 • written 5.8 years ago by microbeatic80
2
gravatar for Pierre Lindenbaum
5.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

" And, there is no taxonomic information in the name"

wrong: you can find the taxon in ftp://ftp.ncbi.nih.gov/genomes/Bacteria/summary.txt

for example: the file Acaryochloris_marina_MBIC11017_uid58167

Accession    GenbankAcc    Length    Taxid    ProjectID    TaxName    Replicon    Create Date    Update Date
NC_009926.1    CP000838.1    374161    329726    58167    Acaryochloris marina MBIC11017    plasmid pREB1    Oct 17 2007    Jun 10 2013  7:03:09:346PM

and in http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=taxonomy&id=329726&retmode=xml

 <Lineage>cellular organisms; Bacteria; Cyanobacteria; Oscillatoriophycideae; Chroococcales; Acaryochloris; Acaryochloris marina</Lineage>
ADD COMMENTlink written 5.8 years ago by Pierre Lindenbaum122k

Wow, didn't realized this. My bad. Thanks. Is there a summary information for DRAFT too?. I scrolled through the folder but didn't see it.

ADD REPLYlink written 5.8 years ago by microbeatic80

for planctomycete_KSU_1_uid163683 , I found it in the gbk file "ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/planctomycete_KSU_1_uid163683/NZ_BAFH00000000.gbk " /db_xref="taxon:247490"

ADD REPLYlink written 5.8 years ago by Pierre Lindenbaum122k

yes, thats for individual genome. A summary like the ones for complete genome ftp would have been better.

Right now i have a rsync set up between ftp and my database. But the list of genomes is only semi automatic right now. I can use the summary file in complete genome ftp to create a list with actinos and make the list totally automatic, but I am confused on how would i do it for the draft genome ftp. Do i have to read in all .gbk files for each organism in that folder?

ADD REPLYlink written 5.8 years ago by microbeatic80
0
gravatar for Phil S.
5.8 years ago by
Phil S.660
Stuttgart, Germany
Phil S.660 wrote:

Hi, maybe this perl script solves your problem:

# This script downloads all genomes of the given organism in RefSeq and puts them in organism.fa
# Script is taken from: http://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large 

use LWP::Simple;


if($#ARGV + 1 > 0) {
    $organism = $ARGV[0];
} else {
    $organism = 'Fungi';
}

$query = $organism.'[orgn]+AND+srcdb_refseq[prop]';
print STDERR "Searching RefSeq for $organism: $query\n";
#assemble the esearch URL
$base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "esearch.fcgi?db=nucleotide&term=$query&usehistory=y";


#post the esearch URL
$output = get($url);


#parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);

print STDERR "Found: $count records for $organism\n"; 
if($count == 0) {
    exit(0);
}

#open output file for writing
open(OUT, ">tmp.$organism.fa") || die "Can't open file!\n";


#retrieve data in batches of 500
$retmax = 500;
for ($ret = 0; $ret < $count; ) {
    $efetch_url = $base ."efetch.fcgi?db=nucleotide&WebEnv=$web";
    $efetch_url .= "&query_key=$key&retstart=$ret";
    $efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
    $efetch_out = get($efetch_url);
    $actual_sequences_returned = $efetch_out =~ s/>/\n>/g;  # count number of sequences returned
    $ret += $actual_sequences_returned;
    print OUT "$efetch_out";
    print STDERR "Fetched $ret\n";
}
close OUT;

rename("tmp.$organism.fa", "$organism.fa");

it is used by:

perl scriptname organismname

in your case

perl scriptname Actinobacteria

Default behaviour is to download fungi...

cheers

ps. see also Ncbi Refseq Viral Genomes

ADD COMMENTlink modified 5.8 years ago • written 5.8 years ago by Phil S.660

This is downloading sequences inside a category and storing them in one file, not genomes.

ADD REPLYlink written 5.8 years ago by Biojl1.7k

afaik this just downloads genome sequences, a first look into the file suggested 'xxx complete genome....'

ADD REPLYlink written 5.8 years ago by Phil S.660
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1052 users visited in the last hour