blastn: -negative_taxids not working
1
1
Entering edit mode
3.3 years ago
nyoungb2 ▴ 10

Example query:

>J01859.1 Escherichia coli 16S ribosomal RNA, complete sequence
AAATTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGT
AACAGGAAGAAGCTTGCTCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATG
GAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCG
GGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACG
ATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGG
CAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTT
CGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCG
CAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAAT
TACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAAC
TGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGT
AGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCG
TGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCC
TTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACT
CAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCT
TACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGC
TGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCT
TTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGA
CGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGA
CCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATG
AAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCG
CCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTT
TGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTT
A

Example search (tried using either taxid 561 or 562):

$ blastn -negative_taxids 562 -db /ebio/abt3_projects/databases_no-backup/NCBI_blastdb/nt -out escherichia-coli-excluded -query ecoli_16S.fna

Output:

$ head -n 100 escherichia-coli-excluded
BLASTN 2.10.1+


Reference: Zheng Zhang, Scott Schwartz, Lukas Wagner, and Webb
Miller (2000), "A greedy algorithm for aligning DNA sequences", J
Comput Biol 2000; 7(1-2):203-14.



Database: Nucleotide collection (nt)
           63,495,790 sequences; 345,639,582,198 total letters



Query= J01859.1 Escherichia coli 16S ribosomal RNA, complete sequence

Length=1541
                                                                      Score     E
Sequences producing significant alignments:                          (Bits)  Value

CP039834.1 Escherichia coli O157:H7 strain MB41-1 chromosome, com...  2841    0.0
CP042892.1 Escherichia coli O10:H32 strain NMBU-W12E19 chromosome...  2841    0.0
CP038421.1 Escherichia coli O157:H7 strain 7636 chromosome, compl...  2841    0.0
CP038416.1 Escherichia coli O157:H7 strain 3-5-1 chromosome, comp...  2841    0.0
CP038412.1 Escherichia coli O157:H7 strain 493/89 chromosome, com...  2841    0.0
CP038398.1 Escherichia coli O157:H7 strain DEC4E chromosome, comp...  2841    0.0
CP038394.1 Escherichia coli O157:H7 strain DEC5A chromosome, comp...  2841    0.0
CP038389.1 Escherichia coli O157:H7 strain DEC5B chromosome, comp...  2841    0.0
CP038380.1 Escherichia coli O157:H7 strain E32511 chromosome, com...  2841    0.0
CP038374.1 Escherichia coli O157:H7 strain F3113 chromosome, comp...  2841    0.0
CP038366.1 Escherichia coli O157:H7 strain F6667 chromosome, comp...  2841    0.0
CP038360.1 Escherichia coli O157:H7 strain F7386 chromosome, comp...  2841    0.0
CP038353.1 Escherichia coli O157:H7 strain F8797 chromosome, comp...  2841    0.0
CP038351.1 Escherichia coli O157:H7 strain F8798 chromosome, comp...  2841    0.0
CP038346.1 Escherichia coli O157:H7 strain G5295 chromosome, comp...  2841    0.0

-negative_taxids is not excluding Escherichia (coli). Anyone have any idea why?

I create/update NCBI nt weekly via:

# echo 'downloading nr'
nice -n +15 update_blastdb.pl --passive --timeout 300 --force --verbose nr &> ${DATE}_nr.updatedb.log
# echo 'untaring nr'
find . -name "*tar.gz" | xargs -n 1 -I % bash -c "tar -xzvf % &>${DATE}_nr_tar.log"
# echo 'removing compressed files'
find . -name "*tar.gz" | xargs -n 1 -I % bash -c "rm -f % &>${DATE}_nr_rm.log"

# echo 'downloading nt'
nice -n +15 update_blastdb.pl --passive --timeout 300 --force --verbose nt &> ${DATE}_nt.updatedb.log
# echo 'untaring nt'
find . -name "*tar.gz" | xargs -n 1 -I % bash -c "tar -xzvf % &>${DATE}_nt_tar.log"
# echo 'removing compressed files'
find . -name "*tar.gz" | xargs -n 1 -I % bash -c "rm -f % &>${DATE}_nt_rm.log"

# echo 'downloading taxonomy'
rm -f taxdb.tar.gz
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz
tar --overwrite -pzxvf taxdb.tar.gz
rm -f taxdb.tar.gz
blastn taxids blast-plus blast+ • 1.9k views
ADD COMMENT
0
Entering edit mode

Thanks for the suggestion! I'm using a version 5 blast db:

$ blastdbcmd -db /ebio/abt3_projects/databases_no-backup/NCBI_blastdb/nt -info
Database: Nucleotide collection (nt)
    63,495,790 sequences; 345,639,582,198 total bases

Date: Dec 7, 2020  12:04 PM Longest sequence: 99,791,824 bases

BLASTDB Version: 5
ADD REPLY
0
Entering edit mode

Ok, actually the problem is (probably) the resolution of that taxid.

From here

BLAST only accepts taxids that are at or below the species level.

A potential solution to your problem is in the linked article. You create a full list of taxids under the 562 taxid..

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Yes, but it may be a special case. Doesn't hurt to try listing all the taxids under it and then giving blast that list as in the linked article..

ADD REPLY
0
Entering edit mode

Why do you think that taxid 562 is a special case? The rank for 562 is listed as species, so why would the taxonomy be special for E. coli? Such inconsistencies wouldn't make much sense, given that the user wouldn't know of such inconsistencies in the rules without trial & error

ADD REPLY
0
Entering edit mode

I'm guessing that blast only uses the taxid assigned to the target in the database, so if a target has a sub-species taxid (eg., E. coli K12), then only that one will be filtered by -negative_taxids. If this is true, then this should be made much clearer in the docs. For example, the blastn help docs do not state anything about this:

-negative_taxids <String>
   Restrict search of database to everything except the specified taxonomy IDs
   (multiple IDs delimited by ',')
    * Incompatible with:  gilist, seqidlist, taxids, taxidlist,

...not even that only species or finer-resolved taxonomic levels are used.

Better yet, blast should actually be able to take into account the taxonomic hierarchy. Is it that hard to do? taxonkit and similar tools do it quite well without too much fuss.

ADD REPLY
0
Entering edit mode

Overall the NCBI is awesome but their documentation is often rather.. well not awesome. I'm not saying this is for sure the solution to your problem, but it wouldn't surprise me at all if it was..

ADD REPLY
0
Entering edit mode

Even when including all 3323 taxids that fall within E. coli (obtained via taxonkit list), I still get all top hits to E. coli with my E. coli 16S search. It appears that -negative_taxids and -negative_taxidlist just don't actually work. It seems that the only way the user can filter the blastn results by taxid is to "manually" post-filter the blast output, which of course is problematic if the user wants the top N hits for taxids not excluded (one cannot just use -max_target_seqs).

ADD REPLY
0
Entering edit mode

Is it possible that you are getting those E. coli entries which are not annotated with a taxID or have wrong taxID? Since you have done all this work you could write to blast help desk and ask this question. Post what they say here.

ADD REPLY
0
Entering edit mode

Yeah, I accidentally left out the root taxid from taxonkit list with is the taxid for the species (561), so I was getting hits to E. coli assigned at the species level, instead of at the sub-species level. Thanks for the idea!

ADD REPLY
0
Entering edit mode

561 is Escherichia (genus)

562 is Escherichia coli (species)

ADD REPLY
0
Entering edit mode

nyoungb2 : Please don't add answers unless you are actually doing that. Use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized.

ADD REPLY
0
Entering edit mode

I am seeing this same issue. My taxids are at the species level, and the database is v5. However it seems to only be an issue in some circumstances. I used a smaller database just so my testing would run faster and I found that when searching against just one chunk of the nr database eg. nr.38, the taxids I specified were correctly removed. But when searching the same query against all of nr with otherwise identical blast parameters, the specified taxa were not removed.

ADD REPLY
0
Entering edit mode
3.3 years ago
5heikki 11k

Only version 5 blast databases support that feature. Perhaps the update_blastdb.pl script still fetches version 4 databases by default. See e.g. here

ADD COMMENT

Login before adding your answer.

Traffic: 1693 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6