Question: Ncbi Refseq Viral Genomes
5
gravatar for Lorddoskias
8.0 years ago by
Lorddoskias130
Lorddoskias130 wrote:

I want to create custom blastdb with all viruses available in the refseq. But I don't know which source files to use. My first point is: ftp://ftp.ncbi.nih.gov/genomes/Viruses/

From research I concluded that I might need the

all.fna.tar.gz

file, since it supposedly contains nucleotide information for all viruses in the refseq, however it turned out that, for example, the Bluetongue_virus_uid14938 is doesn't have an entry in this archive BUT it has a directory and respectively files if I want download the all.gbk.tar.gz archive.

So my question is which archive (file types) should I use in order to create the most complete database of viruses that are in refseq? SHould I used the fna/ffn and just concatenate the files and send them to makeblastdb OR should I manually parse the .gbk files and create fasta files out of them - involving basically extracing the respective fasta sequences from each .gbk and rebuilding the header?

ncbi blast • 9.2k views
ADD COMMENTlink modified 6.1 years ago by bw.150 • written 8.0 years ago by Lorddoskias130
10
gravatar for Pierre Lindenbaum
8.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

Search NCBI entrez for

"Viruses"[PORG] AND srcdb_refseq[PROP]

http://www.ncbi.nlm.nih.gov/sites/nuccore?term=%22Viruses%22[PORG]+AND+srcdb_refseq[PROP]

download your result as FASTA and create your blast database.

ADD COMMENTlink written 8.0 years ago by Pierre Lindenbaum123k
2

If you follow Pierre's advice you do not need to "go over each of the records" - simply choose "send to file" and save as fasta. Also, if you require sequences from other taxa (fungi, bacteria), you should edit your question, since originally you mentioned only viruses.

ADD REPLYlink written 8.0 years ago by Neilfws48k
1

how can i do this with linux. i mean with command line.

ADD REPLYlink written 6.7 years ago by rehma.ar220
2

Do what - just download, or search and download? The former: use wget + link to file in FTP site. The latter: read about EUtils.

ADD REPLYlink written 6.7 years ago by Neilfws48k
1

@neilfws the "Send to" part was crucial - up until now I've been working with the FTP files. Thanks

ADD REPLYlink written 8.0 years ago by Lorddoskias130

Thanks but this is not really a workable solution because I need to to do this in a fashion which would facilitate bulk additions and not just of 10-15 viruses

ADD REPLYlink written 8.0 years ago by Lorddoskias130

what do you mean with "in a fashion which would facilitate bulk additions..." ? 3946 records: what is missing in this entrez dataset ?

ADD REPLYlink written 8.0 years ago by Pierre Lindenbaum123k

By "bulk addition" I mean i have to be able to automate it somehow - going over each of those records and manually downloading the entries is just not an option. Programatically working with the NBIC's ftp is easy enough, the thing is there are SO MANY files/releases that I don't really know which one contains the information I need. Is there some place which explains WHAT is held WHERE on the NCBI ftp - I've read the README's for the various directories but it still is not clear to me.

ADD REPLYlink written 8.0 years ago by Lorddoskias130

Basically I want to create a single blast db which has ALL viral refseq, all bacterial refseq and all fungi refseq - where can I get those respective db's. I thought that <ftp: ftp.ncbi.nih.gov="" genomes="" <whatever-i'm=""> looking for> would suffice but as it turns out - it doesn't.

ADD REPLYlink written 8.0 years ago by Lorddoskias130
1
gravatar for bw.
6.1 years ago by
bw.150
San Francisco
bw.150 wrote:

Downloading via "Send To" didn't give the full fasta - the file only contained 640 sequences out of over 4000 (not sure why).

There's a perl script shown in EUtils docs that should do the job (http://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large), but it assumes that the number of sequences returned by NCBI servers on every 'get' call is equal to the value of the "retmax" url arg - which turns out not to be the case - sometimes they return less, which causes the script to download less sequences overall.

I fixed it so that it retrieves all 4944 viral sequences:

# This script downloads all viral genomes in RefSeq and puts them in viruses.fa
# Script is taken from: http://www.ncbi.nlm.nih.gov/books/NBK25498/#chapter3.Application_3_Retrieving_large 

use LWP::Simple;


if($#ARGV + 1 > 0) {
    $organism = $ARGV[0];
} else {
    $organism = 'viruses';
}

$query = $organism.'[orgn]+AND+srcdb_refseq[prop]';
print STDERR "Searching RefSeq for $organism: $query\n";
#assemble the esearch URL
$base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "esearch.fcgi?db=nucleotide&term=$query&usehistory=y";


#post the esearch URL
$output = get($url);


#parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);

print STDERR "Found: $count records for $organism\n"; 
if($count == 0) {
    exit(0);
}

#open output file for writing
open(OUT, ">tmp.$organism.fa") || die "Can't open file!\n";


#retrieve data in batches of 500
$retmax = 500;
for ($ret = 0; $ret < $count; ) {
    $efetch_url = $base ."efetch.fcgi?db=nucleotide&WebEnv=$web";
    $efetch_url .= "&query_key=$key&retstart=$ret";
    $efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
    $efetch_out = get($efetch_url);
    $actual_sequences_returned = $efetch_out =~ s/>/\n>/g;  # count number of sequences returned
    $ret += $actual_sequences_returned;
    print OUT "$efetch_out";
    print STDERR "Fetched $ret\n";
}
close OUT;

rename("tmp.$organism.fa", "$organism.fa");
ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by bw.150

I did get the full Fasta when I followed Pierre's recipe with Send To, but it's useful to know that it may not always work.

ADD REPLYlink written 6.0 years ago by Mikael Huss4.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1632 users visited in the last hour