I want to remove the bacteria data from the all nt database. Can someone tell me what's the best way to remove it?
As far as I can tell nt sequences are annotated at the Genus level. So only way you may be able to do this is to get those names and exclude ones that are bacteria.
It may be simpler to post-filter your results for bacteria instead?
As @lieven points out below
Restrict search of database to everything except the specified taxonomy IDs
(multiple IDs delimited by ',')
should work. Assuming nt is properly annotated bacterial taxID.
Edit: No sequences in nt appear to be annotated with taxID 2 so that idea is not going to work.
alternatively (if you are using the newest blast version) use the taxonomic filtering options and set that to only report eukaryotic hits. No need to modify your blastDB in this case
EDIT/update : though this seems to work on the NCBI webblast, there are indications this does not work on the (local) CLI version
I'm using blast locally
This would work if I add the Ids of the species to remove. But then again, they can change, so the result will be different.
I think the search within database should now be possible by limiting taxa even in offline BLAST.
See this NCBI webinar
And/or this post: https://ncbiinsights.ncbi.nlm.nih.gov/2019/01/04/blast-2-8-1-with-new-databases-and-better-performance/.
Bu t if you are after sequences, then I'm not aware of any option to extract the sequences directly from nt database. However, one possible way might be to list all accessions in nt (blastdbcmd), run them through entrez OR get yourself accession2taxid table, select which you want and then extract them using blastdbcmd.