Question: Printing output for missing gids using blastdbcmd
0
gravatar for auninsaw
2.8 years ago by
auninsaw10
auninsaw10 wrote:

Hi - I downloaded a copy of the nr.fasta database from NCBI a couple months ago. I used the program DIAMOND to compare a set of sequences to that database and generate a blast tabular output file. DIAMOND requires the initial database to be in fasta, from which one makes a special DIAMOND formatted database. From that resulting blast tabular file, I can easily get a column of accession ids from hits in a text file like so:

XP_013992594.1
KKF19911.1
XP_006633669.1

Now, I want the description lines for each accession in a column. Once I have that I can merge it with the original tabular output file to have all the fields I want in one file. I read about using the blastdbcmd to do this, but from what I understand that needs a pre-formatted NCBI database. My options were to generate the nr database from the nr.fasta file using makeblastdb, or I could download the nr database again as NCBI pre-formatted. Since NCBI says it is more efficient to use the pre-formatted database, I opted to download the complete nr database yesterday. I then used the following command:

blastdbcmd -db /home/aaron/nr_db/nr -entry_batch ids.txt -outfmt '%g %t' -target_only -out test.txt

Because I am comparing hits from the earlier database to the newer download of nr, some of the accessions records have been removed, and I get the error:

Error: XP_006633669.1: OID not found

That's fine, except I now have an original blast tabular file with say 1,000,000 lines, and I want to merge it with a list of descriptions in a column of say 999,000 lines because 1000 entries were no longer in the database. I would like a way of printing to -out from blastdbcmd "Entry not found" whenever it encounters a missing entry, but I don't see this as an option. Doing that would give me two files of the same number of columns, and I would know what entries were missing based on their description. I assume there is a simple way to script this at the command line but I can't figure it out. I tried this:

blastdbcmd -db /home/aaron/nr_db/nr -entry_batch ids.txt -outfmt '%g %t' | awk '{if($0=="") print "Sequence removed from db"; else print $0}'

but I know that is wrong. I don't know what blastdbcmd is "seeing" when it encounters a missing record. Could someone please provide some guidance? The obvious easy answer would be to just format the original nr.fasta into a blastable database, but I am concerned with that approach NCBI recommends against making the nr database that way versus downloading the pre-formatted version. I appreciate your time - Thanks -

blast • 1.4k views
ADD COMMENTlink modified 11 weeks ago by RamRS19k • written 2.8 years ago by auninsaw10
1
gravatar for genomax
2.8 years ago by
genomax59k
United States
genomax59k wrote:

When you run the batch search you get a clear error for the accession number that is missing.

Error: [blastdbcmd] Entry not found in BLAST database: 'XP_006633669.1'

You can capture these std error outputs in a separate file.

$ blastdbcmd -db /path_to/nr -entry_batch ids -outfmt '%g %t' -target_only -out test.txt 2> dbcmd_err

Then add the id's in this file to real output file.

$ awk -F "'" '{print$2"\tSequence removed from db"}' dbcmd_err
XP_006633669.1  Sequence removed from db
ADD COMMENTlink modified 11 weeks ago by RamRS19k • written 2.8 years ago by genomax59k

Thanks for the response. I need to clarify what I wrote above in that I made a mistake in what I wrote about output I am getting. If I run:

$ blastdbcmd -db /path_to/nr -entry_batch ids -outfmt '%g %t' -out test.txt

I get the following error:

Error: XP_006633669.1: OID not found

If I run with the -target_only flag:

$ blastdbcmd -db /path_to/nr -entry_batch ids -outfmt '%g %t' -target_only -out test.txt

I get this error:

Error: Entry not found in BLAST database

which is repeated 26 times without any associated accession number. I am using blast version 2.2.29+ if that makes a difference. If I append 2> dbcmd_err as suggested, the resulting file entries have no accession number associated with them, just 'Error: Entry not found in BLAST database' repeated 26 times. Any idea why the error message I am getting is not the same?

ADD REPLYlink modified 11 weeks ago by RamRS19k • written 2.8 years ago by auninsaw10

I used blast v. 2.3.0 and I get the following error for each accession not found in the database (last two entries are fake used for testing) with the command line posted above.

Error: [blastdbcmd] Entry not found in BLAST database: 'XP_006633669.1'
Error: [blastdbcmd] Entry not found in BLAST database: 'XZ_006633669.1'
Error: [blastdbcmd] Entry not found in BLAST database: 'XG_006633669.1'

Which includes the accession #. Can you upgrade to 2.3.0?

ADD REPLYlink modified 11 weeks ago by RamRS19k • written 2.8 years ago by genomax59k

Okay, upgrading to the latest version fixed that, and I get the same output as you do now. I can run the commands as you suggested and append the output to the end of the tabular blast file and have the same number rows in each column.

However, my goal was to be able to take the newly generated column of accessions and descriptions and merge it to the tabular file and have the rows line up in perfect original order. The way I have the columns structured now, the orders of data in the columns would not be the same because missing accessions with their descriptions would all be at the bottom of the new column that I am appending to the original tabular blast file. Without merging the tables in R or perl or something, I could have lines in the tabular blast file that are matched with the wrong descriptions.

This newest version of blast prints to screen exactly how I want the output, where it goes record by record and when it sees a missing accession, it prints the error. The earlier version I had before did not do that it just printed errors. I tried the following which directs the output to a file but also prints it to screen:

blastdbcmd -db /path_to_nr/nr -entry_batch ids -outfmt '%a %t' -target_only 2>&1 | tee -a output.txt

Now the file output.txt has the error message printed into the output where it occurs instead of having to append it on the end afterwards. That gives me the data in the order I want but it all prints to screen and to the file. Can I alter the command to just print to the file and turn off printing to the screen?

Thanks again for your help.

ADD REPLYlink modified 11 weeks ago by RamRS19k • written 2.8 years ago by auninsaw10
1

The following should write only to the file and suppress writing to screen.

$ blastdbcmd -db /path_to_nr/nr -entry_batch ids -outfmt '%a %t' -target_only 2>&1 | dd of=out.txt
ADD REPLYlink modified 11 weeks ago by RamRS19k • written 2.8 years ago by genomax59k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1047 users visited in the last hour