Hi - I downloaded a copy of the nr.fasta database from NCBI a couple months ago. I used the program DIAMOND to compare a set of sequences to that database and generate a blast tabular output file. DIAMOND requires the initial database to be in fasta, from which one makes a special DIAMOND formatted database. From that resulting blast tabular file, I can easily get a column of accession ids from hits in a text file like so:
XP_013992594.1
KKF19911.1
XP_006633669.1
Now, I want the description lines for each accession in a column. Once I have that I can merge it with the original tabular output file to have all the fields I want in one file. I read about using the blastdbcmd to do this, but from what I understand that needs a pre-formatted NCBI database. My options were to generate the nr database from the nr.fasta file using makeblastdb, or I could download the nr database again as NCBI pre-formatted. Since NCBI says it is more efficient to use the pre-formatted database, I opted to download the complete nr database yesterday. I then used the following command:
blastdbcmd -db /home/aaron/nr_db/nr -entry_batch ids.txt -outfmt '%g %t' -target_only -out test.txt
Because I am comparing hits from the earlier database to the newer download of nr, some of the accessions records have been removed, and I get the error:
Error: XP_006633669.1: OID not found
That's fine, except I now have an original blast tabular file with say 1,000,000 lines, and I want to merge it with a list of descriptions in a column of say 999,000 lines because 1000 entries were no longer in the database. I would like a way of printing to -out from blastdbcmd "Entry not found" whenever it encounters a missing entry, but I don't see this as an option. Doing that would give me two files of the same number of columns, and I would know what entries were missing based on their description. I assume there is a simple way to script this at the command line but I can't figure it out. I tried this:
blastdbcmd -db /home/aaron/nr_db/nr -entry_batch ids.txt -outfmt '%g %t' | awk '{if($0=="") print "Sequence removed from db"; else print $0}'
but I know that is wrong. I don't know what blastdbcmd is "seeing" when it encounters a missing record. Could someone please provide some guidance? The obvious easy answer would be to just format the original nr.fasta into a blastable database, but I am concerned with that approach NCBI recommends against making the nr database that way versus downloading the pre-formatted version. I appreciate your time - Thanks -