I work on some protist genomes and because the protists live in symbiosis with bacteria usually I have bacterial contamination in the NGS data.
Previously, I was using a script to decontaminate the genomic data from bacterial hits, and among other things, the script was based on blastn and blastp, and instead of having subset of bacteria+archea database, I was using GI lists to filter the blast against nt and nr (using -gilist -negativegilist parameter)
I did not create custom subsets because I was thinking that is easier to update from time to time the gi-lists and the entire nt and nr database, instead of doing everything mention above and also to update the custom subset databases.
Now, NCBI is phasing out GI numbers, as I am sure everybody knows, and I am quite stuck with my script. I cannot use accession number as a way to filter the results, and also I don't know any way to make a custom subset of the nt or nr database using accession numbers and not GI numbers. I know that with GI numbers you could use the blastdb_aliastool to create a custom database based on your gilist but is there a way to do the same thing with accession numbers?
For my script it is very important somehow to have the filtering done in the blast, or to have custom blast database based on taxonomy, because I use custom outfmt, and one of the methods used in classification of a sequence as "contaminant" are coverage and identity thresholds which are calculated for the entire sequence length, and based on values from the blast results. It would not work if I would not get the results in the same order as blastn or blastp outputs them, which most probably would be the case if I would use some post filtering.
Thank you in advance for any help