Question

How To Make A Blast Database With Taxonids From Ncbi Query.

4

Entering edit mode

12.4 years ago

Michael 54k

I am seemingly stuck with something that should be very simple and I hope I haven't overlooked something obvious.

Question: How can I make a valid Blast-database with Taxids from a NCBI query export?

What I have tried so far:

For a meta-genomics project I need a custom made blast database which I wish to generate from the result of the following NCBI Nucleotide query:

Viruses[Organism] AND srcdb_refseq[PROP] NOT cellular organisms [ORGN]

The result is 3986 entries which I exported and saved (via 'Send to') in FASTA and ASN1 format. (Both files are seemingly containing the right amount of entries) As this is a meta-genomics project I would love to have the taxon ids in the blastdb.

I was successful with making a valid blast database from the FASTA file using makeblastdb, but the FASTA header doesn't include taxids, hence I tried to make a blast database from the ASN1 export using the following command (it is not clear from the documentation which formats can be used to create the database):

    $ makeblastdb -in AllViralDNARefSeq.asn1 -dbtype nucl  -out ViralASN1 -title  "All Viral RefSeq DNA from NCBI ASN1"

Building a new DB, current time: 12/20/2011 10:37:28
New DB name:   ViralASN1
New DB title:  All Viral RefSeq DNA from NCBI ASN1
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1073741824B
Adding sequences from FASTA; **added 10 sequences** in 0.00906897 seconds.

As you can see, this does not work as it adds only 10 sequences.

Any help to get in the Taxonids is appreciated it doesn't have to be elegant, I just need the database from that query. I am using Blast+ 2.2.25

blast makeblastdb metagenomics taxonomy • 17k views

ADD COMMENT • link updated 8.1 years ago by 5heikki 11k • written 12.4 years ago by Michael 54k

0

Entering edit mode

Just notice that it is mentioned nowhere in the manual, that makeblastdb supports anything else than FASTA.

ADD REPLY • link 12.4 years ago by Michael 54k

0

Entering edit mode

Isn't it similar (without going through all your text) to this question making a BLAST DB alias based on gi's? http://biostar.stackexchange.com/questions/15047/make-a-custom-blast-library-using-the-output-of-another-blast-result/15050#15050

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 12.4 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Not the same question, no - they want the taxon ID in the fasta file before formatting the database.

ADD REPLY • link 12.4 years ago by Neilfws 49k

3

Entering edit mode

12.4 years ago

Michael 54k

Even though Neil's answer put me on the right track and is therefore marked as the correct answer, I would like to show you my modified solution built on Neil's solution. There is in fact one additional step required, and that is to generate an id-to-taxon mapping file.

Note that it is not possible to add the taxon id to the FASTA id-line in the as e.g. gnl|taxon|9606 as described here, because that will yield an error on duplicate ids.

The following code generates a .fa file and a .map file which contains one mapping of sequence id to taxid per line.

The fasta headers look like this:

>ref|NC_003038  taxon=176652, Invertebrate iridescent virus 6, complete genome.

and include the taxon id in the description part.

use strict;
use warnings;
use Bio::SeqIO;

my $file  = $ARGV[0];
my $seqio = Bio::SeqIO->new(-file => $file, -format => "genbank");
my $fasta = Bio::SeqIO->new(-file => ">$file.fa", -format => "fasta");
open TAXMAP, ">$file.map" || die "couldn't open mapping file: $!\n";

while(my $seq = $seqio->next_seq) {
    my $taxid = "";
    for my $feat($seq->get_SeqFeatures) {
    if($feat->has_tag("db_xref")) {
        for my $id($feat->get_tag_values("db_xref")) {
        if($id =~/taxon:(\d+)/) {
            $taxid = $1;
        }
        }
       }
    }
    my $id = "ref|".$seq->id;
    my $fa = Bio::Seq->new(-id => $id,
                           -desc => " taxon=$taxid, ".$seq->description,
                           -seq => $seq->seq);
    $fasta->write_seq($fa);
    print TAXMAP "$id $taxid\n" if $taxid;
}

Using the output of this script, the database can be built using the following command:

makeblastdb -in viral.1.genomic.gbff.fa -parse_seqids -dbtype nucl \ 
-taxid_map viral.1.genomic.gbff.map

Note that according to this question in versions before 2.2.25+ this doesn't work.

The presence of the taxonids in the DB can be checked using the following command:

blastdbcmd -db viral.1.genomic.gbff.fa -entry all -outfmt "%T"
176652
130760
...

I really hope this is of great help for everybody trying something similar.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 12.4 years ago by Michael 54k

1

Entering edit mode

12.4 years ago

Damian Kao 16k

If you type makeblastdb -help. It looks like the new makeblastdb doesn't take in ANS1 format anymore:

-in <File_In>
   Input file/database name; the data type is automatically detected, it may
   be any of the following:
    FASTA file(s) and/or 
    BLAST database(s)

However, towards the bottom there are:

 *** Taxonomy options
 -taxid <Integer, >=0>
   Taxonomy ID to assign to all sequences
    * Incompatible with:  taxid_map
 -taxid_map <File_In>
   Text file mapping sequence IDs to taxonomy IDs.
   Format:<SequenceId> <TaxonomyId><newline>
    * Incompatible with:  taxid
 -logfile <File_Out>
   File to which the program log should be redirected

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 12.4 years ago by Damian Kao 16k

0

Entering edit mode

That's what I had feared, and making the taxid_map file is more complicated.

ADD REPLY • link 12.4 years ago by Michael 54k

0

Entering edit mode

You don't need to make the taxid_map file yourself, NCBI has already done it for you. The file you want is gi_taxid_nucl.dmp.gz and you can download it from ftp://ftp.ncbi.nih.gov/pub/taxonomy/

ADD REPLY • link 12.3 years ago by Kmcarr ▴ 10

1

Entering edit mode

8.1 years ago

5heikki 11k

So NCBI is phasing out GI numbers. For example, sequences downloaded from ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/ only include accession numbers and titles in headers. So there's this file: ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/wgs.dump.gz, which includes said accessions in column 2 and taxids in column 3. However, when I extract those columns and use the resulting file as input for -taxid_map in makeblastdb, it doesn't work (Error: [makeblastdb] No sequences matched any of the taxids provided.). This is with the latest blast executables (2.3.0+). Rather annoying..

ADD COMMENT • link 8.1 years ago by 5heikki 11k

0

Entering edit mode

I'm getting the same error when using makeblastdb with the -taxid_map option

ADD REPLY • link 7.8 years ago by Sujai Kumar ▴ 270

0

Entering edit mode

This was a solution posted by @5heikki recently: A: How to make a custom blast db with taxon IDs from a taxid_map file

ADD REPLY • link 7.8 years ago by GenoMax 141k

Ram · Accepted Answer · 2011-12-20

6

Entering edit mode

12.4 years ago

Neilfws 49k

I think this one requires some scripting. Here are a few ideas.

First, I'd fetch viral Refseq sequences in Genbank format from the FTP site:

wget ftp://ftp.ncbi.nih.gov/refseq/release/viral/viral.1.genomic.gbff.gz
gunzip viral.1.genomic.gbff.gz
grep -c "^LOCUS" viral.1.genomic.gbff # 3984

Then I'd parse the Genbank file to extract an ID, description and sequence for each entry. Since taxon IDs are contained in a field of the form /db_xref="taxon:NNNN where NNNN = taxon ID, they can be extracted and written into the header of a new fasta file.

Some quick and dirty Bioperl to illustrate:

#!/usr/bin/perl -w

use strict;
use Bio::SeqIO;

my $file  = "viral.1.genomic.gbff";
my $seqio = Bio::SeqIO->new(-file => $file, -format => "genbank");
my $fasta = Bio::SeqIO->new(-file => ">$file.fa", -format => "fasta");

while(my $seq = $seqio->next_seq) {
    my $taxid = "";
    for my $feat($seq->get_SeqFeatures) {
    if($feat->has_tag("db_xref")) {
        for my $id($feat->get_tag_values("db_xref")) {
        if($id =~/taxon:(\d+)/) {
            $taxid = $1;
        }
        }
       }
    }
    my $fa = Bio::Seq->new(-id => $seq->id,
                           -desc => $taxid.", ".$seq->description,
                           -seq => $seq->seq);
    $fasta->write_seq($fa);
}

This gives a fasta file viral.1.genomic.gbff.fa, with headers that look like:

>NC_003038 176652, Invertebrate iridescent virus 6, complete genome.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 12.4 years ago by Neilfws 49k

3

Entering edit mode

A general note: When including more than one piece of information in the header, it is better to include keys to the fields added to make it easier to both remember what they are and extract the information later. E.g.:

NC_003038 taxid=176652 desc="Invertebrate iridescent virus 6, complete genome."

Note that the BLAST output can be truncated, so put the most important information right after the ID.

ADD REPLY • link 12.4 years ago by Heikki ▴ 360

0

Entering edit mode

thank you Neil, I was hoping to get this done without Bioperl, but if it is not possible otherwise, i'll give it a try.

ADD REPLY • link 12.4 years ago by Michael 54k

0

Entering edit mode

I'm sure it's equally possible with any Bio* library, or even without them (but they make it a lot easier).

ADD REPLY • link 12.4 years ago by Neilfws 49k

0

Entering edit mode

I would like to try using Bioperl first, but guess what, installing Bioperl locally on that particular computer is a nightmare (CentOS 5.4, perl 5.8.8!, no root access to install anything). I am still trying....

ADD REPLY • link 12.4 years ago by Michael 54k