How is taxonomy information injected into BLAST databases?
My application logic is requiring me to rebuild nr from the fasta file (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/) because I need to make some custom changes to the sequence headers:
In that file the headers does not seem to have taxonomy information other than the name of the taxonomy rank in brackets like this [Bacillus]. That doesn't seem to be enough to perform extractions using blasdbcmd like this
$ blastdbcmd -db nr -entry all -outfmt "%g %T" | \
awk ' { if ($2 == 9606) { print $1 } } ' | \
blastdbcmd -db nr -entry_batch - -out human_sequences.txt
There is an option called taxid_map in makeblastdb but where do I get the mapping file?
I guess a simpler way to ask my question is what command does NCBI use to make their nr database from the nr fasta file?