Question: How to make a custom blast db with taxon IDs from a taxid_map file
0
gravatar for rwn
2.7 years ago by
rwn440
United Kingdom
rwn440 wrote:

Dear all,

I'm attempting to create a custom BLAST database from a dozen or so whole genomes. For downstream analyses it's necessary to have the taxon ID numbers included in the blast db. This seems like it should be simple enough using the -parse_seqids and -taxid_map <taxmap.txt> commands in makeblastdb, but alas, apparently not.

My fasta headers look like:

>HG380758.1
>HG380759.1
>HG380760.1
...

and my taxid_map.txt file looks like:

HG380758.1 104782
HG380759.1 104782
HG380760.1 104782
...

The command I've run is then:

makeblastdb -dbtype nucl -in in.fna -parse_seqids -taxid_map taxid_map.txt

Unfortunately, this returns the error:

Building a new DB, current time: 07/08/2016 11:53:59
New DB name:   in.fna
New DB title:  in.fna
Sequence type: Nucleotide
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 36167 sequences in 8.00621 seconds.
Error: [makeblastdb] No sequences matched any of the taxids provided.

I've read questions How To Make A Blast Database With Taxonids From Ncbi Query. and Ncbi Blast+ Taxid And Taxid_Map (also http://www.verdantforce.com/2014/12/building-blast-databases-with-taxonomy.html), and basically can't see what I'm doing wrong. I have also tried reformatting the fasta headers to include the taxid, as in >HG380758.1 taxon=104782, and including the > seqid prefix in the taxid_map.txt file, but to no avail.

I'm using makeblastdb version 2.3.0+, and I note from previous similar queries the -taxid_map parameter has not always been functional.

Is this a bug with this version of makeblastdb, or am I still doing something wrong? Any help / workarounds would be much appreciated!

blast taxid_map makeblastdb • 2.5k views
ADD COMMENTlink modified 2.3 years ago by xzhan01820 • written 2.7 years ago by rwn440
3
gravatar for 5heikki
2.7 years ago by
5heikki8.3k
Finland
5heikki8.3k wrote:

I asked about this from NCBI. At least with blast 2.3.0+, it is simply not possible with accession numbers, but only GIs (which they're phasing out very soon). They recently updated blast, but the change log didn't mention anything related to this, so I presume it's still the case.

What you can do is to use tabular output (-outfmt 6) and then

export LC_ALL=C LANG=C
join -t $'\t '-1 2 -2 1 -o 1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,1.10,1.11,1.12,2.2 \
    <(sort -t $'\t' -k2,2 blastOutput) \
    <(sort -t $'\t' -k1,1 yourMap) \
    > blastOutputWithTaxIdInCol13
ADD COMMENTlink modified 2.7 years ago • written 2.7 years ago by 5heikki8.3k

Thank-you for the info and for the workaround. I feared this would be the case! It would be nice if this was made clearer in the makeblastdb help, but hey ho.

ADD REPLYlink written 2.7 years ago by rwn440

I am using 2.2.30+, accession number is supported.

ADD REPLYlink written 2.3 years ago by xzhan01820
1
gravatar for xzhan018
2.3 years ago by
xzhan01820
xzhan01820 wrote:

I have the same problem for recently downloaded mouse genome fasta file.

Reason: the header format changed and could not be successfully parsed because "|" separator is missing.

My solution is to convert ">NC_000067.6" into ">ref|NC_000067.6|" using sed.

sed -i "s/^>\([^ ]*\) />ref|\1| /g" mouse_genome.fna

Result: error message "No sequences matched any of the taxids provided." is gone and blastcmd can extract taxid as expected.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by xzhan01820
1

This is a workaround for blast 2.2.30+, makeblastdb has no such issue for 2.6.0+

ADD REPLYlink written 23 months ago by xzhan01820

Thanks! Upgrading solved this issue for me too.

ADD REPLYlink written 19 months ago by russianconcussion20
0
gravatar for rwn
2.7 years ago by
rwn440
United Kingdom
rwn440 wrote:

Another way around is to create each species-specific database individually, assigning taxids using the -taxid flag:

$ makeblastdb -dbtype nucl -in species1.fna -taxid 1234
$ makeblastdb -dbtype nucl -in species2.fna -taxid 5678
$ ...

and then aggregate the db's into a single entity using blastdb_aliastool:

$ blastdb_aliastool -dbtype nucl -title mydb -out mydb -dblist "species1.fna species2.fna [etc...]"

BLAST run with -db mydb -outfmt '6 std staxids' will then report the taxids for any hits.

ADD COMMENTlink written 2.7 years ago by rwn440
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1921 users visited in the last hour