Question

How to make a custom blast db with taxon IDs from a taxid_map file

4

Entering edit mode

7.8 years ago

rwn ▴ 590

Dear all,

I'm attempting to create a custom BLAST database from a dozen or so whole genomes. For downstream analyses it's necessary to have the taxon ID numbers included in the blast db. This seems like it should be simple enough using the -parse_seqids and -taxid_map <taxmap.txt> commands in makeblastdb, but alas, apparently not.

My fasta headers look like:

>HG380758.1
>HG380759.1
>HG380760.1
...

and my taxid_map.txt file looks like:

HG380758.1 104782
HG380759.1 104782
HG380760.1 104782
...

The command I've run is then:

makeblastdb -dbtype nucl -in in.fna -parse_seqids -taxid_map taxid_map.txt

Unfortunately, this returns the error:

Building a new DB, current time: 07/08/2016 11:53:59
New DB name:   in.fna
New DB title:  in.fna
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 36167 sequences in 8.00621 seconds.
Error: [makeblastdb] No sequences matched any of the taxids provided.

I've read questions How To Make A Blast Database With Taxonids From Ncbi Query. and Ncbi Blast+ Taxid And Taxid_Map (also http://www.verdantforce.com/2014/12/building-blast-databases-with-taxonomy.html), and basically can't see what I'm doing wrong. I have also tried reformatting the fasta headers to include the taxid, as in >HG380758.1 taxon=104782, and including the > seqid prefix in the taxid_map.txt file, but to no avail.

I'm using makeblastdb version 2.3.0+, and I note from previous similar queries the -taxid_map parameter has not always been functional.

Is this a bug with this version of makeblastdb, or am I still doing something wrong? Any help / workarounds would be much appreciated!

blast makeblastdb taxid_map • 10.0k views

ADD COMMENT • link updated 4.3 years ago by ale_abd ▴ 50 • written 7.8 years ago by rwn ▴ 590

score 4 · Answer 1 · 2016-07-08

4

Entering edit mode

7.8 years ago

5heikki 11k

I asked about this from NCBI. At least with blast 2.3.0+, it is simply not possible with accession numbers, but only GIs (which they're phasing out very soon). They recently updated blast, but the change log didn't mention anything related to this, so I presume it's still the case.

What you can do is to use tabular output (-outfmt 6) and then

export LC_ALL=C LANG=C
join -t $'\t '-1 2 -2 1 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.2 \
    <(sort -t $'\t' -k2,2 blastOutput) \
    <(sort -t $'\t' -k1,1 yourMap) \
    > blastOutputWithTaxIdInCol13

ADD COMMENT • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

Thank-you for the info and for the workaround. I feared this would be the case! It would be nice if this was made clearer in the makeblastdb help, but hey ho.

ADD REPLY • link 7.8 years ago by rwn ▴ 590

0

Entering edit mode

I am using 2.2.30+, accession number is supported.

ADD REPLY • link 7.4 years ago by xzhan018 ▴ 30

score 2 · Answer 2 · 2016-11-24

2

Entering edit mode

7.4 years ago

xzhan018 ▴ 30

I have the same problem for recently downloaded mouse genome fasta file.

Reason: the header format changed and could not be successfully parsed because "|" separator is missing.

My solution is to convert ">NC_000067.6" into ">ref|NC_000067.6|" using sed.

sed -i "s/^>\([^ ]*\) />ref|\1| /g" mouse_genome.fna

Result: error message "No sequences matched any of the taxids provided." is gone and blastcmd can extract taxid as expected.

ADD COMMENT • link 7.4 years ago by xzhan018 ▴ 30

1

Entering edit mode

This is a workaround for blast 2.2.30+, makeblastdb has no such issue for 2.6.0+

ADD REPLY • link 7.0 years ago by xzhan018 ▴ 30

0

Entering edit mode

Thanks! Upgrading solved this issue for me too.

ADD REPLY • link 6.7 years ago by russianconcussion ▴ 30

score 1 · Answer 3 · 2020-01-09

Is important to bear in mind that makeblastdb command parse fasta sequence ids as >ACC.version, but it seems that the taxid_map option parses the taxid map without taking into account the same format ACC.version, therefore if you want to build a blast database with sequence that looks like:

>HG380758.1
>HG380759.1
>HG380760.1
...

You should provide a taxid map that looks like:

HG380758 104782
HG380759 104782
HG380760 104782

Just do not include the version and the problem should be solved.

For example, Silva database uses INSDC accessions in the form >ACC.from.to so, building a blast database using SSURef r138 release can be done like this:

First, create a mapping file:

zcat taxmap_embl-ebi_ena_ssu_parc_138.txt.gz | awk -F"\t"  'NR>1{print $1" "$6}' > taxmap_embl-ebi_ena_ssu_parc_138.txt

Then, just run makeblastdb command:

zcat SILVA_138_SSURef_tax_silva.fasta.gz | makeblastdb -in -  -out blast2/SSURef -dbtype nucl -parse_seqids -title 'Silva 138 SSURef with tax ids' -taxid_map taxmap_embl-ebi_ena_ssu_ref_138.txt

score 0 · Answer 4 · 2016-07-08

Another way around is to create each species-specific database individually, assigning taxids using the -taxid flag:

$ makeblastdb -dbtype nucl -in species1.fna -taxid 1234
$ makeblastdb -dbtype nucl -in species2.fna -taxid 5678
$ ...

and then aggregate the db's into a single entity using blastdb_aliastool:

$ blastdb_aliastool -dbtype nucl -title mydb -out mydb -dblist "species1.fna species2.fna [etc...]"

BLAST run with -db mydb -outfmt '6 std staxids' will then report the taxids for any hits.