Question: How To Make A Blast Database With Taxonids From Ncbi Query.
4
gravatar for Michael Dondrup
7.8 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

I am seemingly stuck with something that should be very simple and I hope I haven't overlooked something obvious.

Question: How can I make a valid Blast-database with Taxids from a NCBI query export?

What I have tried so far:

For a meta-genomics project I need a custom made blast database which I wish to generate from the result of the following NCBI Nucleotide query:

Viruses[Organism] AND srcdb_refseq[PROP] NOT cellular organisms [ORGN]

The result is 3986 entries which I exported and saved (via 'Send to') in FASTA and ASN1 format. (Both files are seemingly containing the right amount of entries) As this is a meta-genomics project I would love to have the taxon ids in the blastdb.

I was successful with making a valid blast database from the FASTA file using makeblastdb, but the FASTA header doesn't include taxids, hence I tried to make a blast database from the ASN1 export using the following command (it is not clear from the documentation which formats can be used to create the database):

    $ makeblastdb -in AllViralDNARefSeq.asn1 -dbtype nucl  -out ViralASN1 -title  "All Viral RefSeq DNA from NCBI ASN1"

Building a new DB, current time: 12/20/2011 10:37:28
New DB name:   ViralASN1
New DB title:  All Viral RefSeq DNA from NCBI ASN1
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; **added 10 sequences** in 0.00906897 seconds.

As you can see, this does not work as it adds only 10 sequences.

Any help to get in the Taxonids is appreciated it doesn't have to be elegant, I just need the database from that query. I am using Blast+ 2.2.25

ADD COMMENTlink modified 3.5 years ago by 5heikki8.5k • written 7.8 years ago by Michael Dondrup46k

Just notice that it is mentioned nowhere in the manual, that makeblastdb supports anything else than FASTA.

ADD REPLYlink written 7.8 years ago by Michael Dondrup46k

Isn't it similar (without going through all your text) to this question making a BLAST DB alias based on gi's? http://biostar.stackexchange.com/questions/15047/make-a-custom-blast-library-using-the-output-of-another-blast-result/15050#15050

ADD REPLYlink written 7.8 years ago by ALchEmiXt1.9k

Not the same question, no - they want the taxon ID in the fasta file before formatting the database.

ADD REPLYlink written 7.8 years ago by Neilfws48k
6
gravatar for Neilfws
7.8 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

I think this one requires some scripting. Here are a few ideas.

First, I'd fetch viral Refseq sequences in Genbank format from the FTP site:

wget ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.1.genomic.gbff.gz
gunzip viral.1.genomic.gbff.gz
grep -c "^LOCUS" viral.1.genomic.gbff # 3984

Then I'd parse the Genbank file to extract an ID, description and sequence for each entry. Since taxon IDs are contained in a field of the form "/db_xref="taxon:NNNN" where NNNN = taxon ID, they can be extracted and written into the header of a new fasta file.

Some quick and dirty Bioperl to illustrate:

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

my $file  = "viral.1.genomic.gbff";
my $seqio = Bio::SeqIO->new(-file => $file, -format => "genbank");
my $fasta = Bio::SeqIO->new(-file => ">$file.fa", -format => "fasta");

while(my $seq = $seqio->next_seq) {
    my $taxid = "";
    for my $feat($seq->get_SeqFeatures) {
    if($feat->has_tag("db_xref")) {
        for my $id($feat->get_tag_values("db_xref")) {
        if($id =~/taxon:(\d+)/) {
            $taxid = $1;
        }
        }
       }
    }
    my $fa = Bio::Seq->new(-id => $seq->id,
                           -desc => $taxid.", ".$seq->description,
                           -seq => $seq->seq);
    $fasta->write_seq($fa);
}

This gives a fasta file viral.1.genomic.gbff.fa, with headers that look like:

>NC_003038 176652, Invertebrate iridescent virus 6, complete genome.
ADD COMMENTlink modified 7.8 years ago • written 7.8 years ago by Neilfws48k
3

A general note: When including more than one piece of information in the header, it is better to include keys to the fields added to make it easier to both remember what they are and extract the information later. E.g.:

NC_003038 taxid=176652 desc="Invertebrate iridescent virus 6, complete genome."

Note that the BLAST output can be truncated, so put the most important information right after the ID.

ADD REPLYlink written 7.8 years ago by Heikki360

thank you Neil, I was hoping to get this done without Bioperl, but if it is not possible otherwise, i'll give it a try.

ADD REPLYlink written 7.8 years ago by Michael Dondrup46k

I'm sure it's equally possible with any Bio* library, or even without them (but they make it a lot easier).

ADD REPLYlink written 7.8 years ago by Neilfws48k

I would like to try using Bioperl first, but guess what, installing Bioperl locally on that particular computer is a nightmare (CentOS 5.4, perl 5.8.8!, no root access to install anything). I am still trying....

ADD REPLYlink written 7.8 years ago by Michael Dondrup46k
3
gravatar for Michael Dondrup
7.8 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Even though Neil's answer put me on the right track and is therefore marked as the correct answer, I would like to show you my modified solution built on Neil's solution. There is in fact one additional step required, and that is to generate an id-to-taxon mapping file.

Note that it is not possible to add the taxon id to the FASTA id-line in the as e.g. gnl|taxon|9606 as described here, because that will yield an error on duplicate ids.

The following code generates a .fa file and a .map file which contains one mapping of sequence id to taxid per line.

The fasta headers look like this:

>ref|NC_003038 taxon=176652, Invertebrate iridescent virus 6, complete genome.

and include the taxon id in the description part.

use strict;
use warnings;
use Bio::SeqIO;

my $file  = $ARGV[0];
my $seqio = Bio::SeqIO->new(-file => $file, -format => "genbank");
my $fasta = Bio::SeqIO->new(-file => ">$file.fa", -format => "fasta");
open TAXMAP, ">$file.map" || die "couldn't open mapping file: $!\n";

while(my $seq = $seqio->next_seq) {
    my $taxid = "";
    for my $feat($seq->get_SeqFeatures) {
    if($feat->has_tag("db_xref")) {
        for my $id($feat->get_tag_values("db_xref")) {
        if($id =~/taxon:(\d+)/) {
            $taxid = $1;
        }
        }
       }
    }
    my $id = "ref|".$seq->id;
    my $fa = Bio::Seq->new(-id => $id,
                           -desc => " taxon=$taxid, ".$seq->description,
                           -seq => $seq->seq);
    $fasta->write_seq($fa);
    print TAXMAP "$id $taxid\n" if $taxid;
}

Using the output of this script, the database can be built using the following command:

makeblastdb -in viral.1.genomic.gbff.fa -parse_seqids -dbtype nucl \ 
-taxid_map viral.1.genomic.gbff.map

Note that according to this question in versions before 2.2.25+ this doesn't work.

The presence of the taxonids in the DB can be checked using the following command:

blastdbcmd -db viral.1.genomic.gbff.fa -entry all -outfmt "%T"
176652
130760
...

I really hope this is of great help for everybody trying something similar.

ADD COMMENTlink modified 2 days ago by RamRS24k • written 7.8 years ago by Michael Dondrup46k
1
gravatar for Damian Kao
7.8 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

If you type makeblastdb -help. It looks like the new makeblastdb doesn't take in ANS1 format anymore:

-in <File_In>
   Input file/database name; the data type is automatically detected, it may
   be any of the following:
    FASTA file(s) and/or 
    BLAST database(s)

However, towards the bottom there are:

 *** Taxonomy options
 -taxid <Integer, >=0>
   Taxonomy ID to assign to all sequences
    * Incompatible with:  taxid_map
 -taxid_map <File_In>
   Text file mapping sequence IDs to taxonomy IDs.
   Format:<SequenceId> <TaxonomyId><newline>
    * Incompatible with:  taxid
 -logfile <File_Out>
   File to which the program log should be redirected
ADD COMMENTlink written 7.8 years ago by Damian Kao15k

That's what I had feared, and making the taxid_map file is more complicated.

ADD REPLYlink written 7.8 years ago by Michael Dondrup46k

You don't need to make the taxid_map file yourself, NCBI has already done it for you. The file you want is gi_taxid_nucl.dmp.gz and you can download it from ftp://ftp.ncbi.nih.gov/pub/taxonomy/

ADD REPLYlink written 7.8 years ago by Kmcarr10
1
gravatar for 5heikki
3.5 years ago by
5heikki8.5k
Finland
5heikki8.5k wrote:

So NCBI is phasing out GI numbers. For example, sequences downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/ only include accession numbers and titles in headers. So there's this file: ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/wgs.dump.gz, which includes said accessions in column 2 and taxids in column 3. However, when I extract those columns and use the resulting file as input for -taxid_map in makeblastdb, it doesn't work (Error: [makeblastdb] No sequences matched any of the taxids provided.). This is with the latest blast executables (2.3.0+). Rather annoying..

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by 5heikki8.5k

I'm getting the same error when using makeblastdb with the -taxid_map option

ADD REPLYlink written 3.2 years ago by Sujai Kumar240

This was a solution posted by @5heikki recently: A: How to make a custom blast db with taxon IDs from a taxid_map file

ADD REPLYlink written 3.2 years ago by genomax71k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 849 users visited in the last hour