do blast search using gi numbers
2
0
Entering edit mode
3.7 years ago
bitpir ▴ 190

Hi, I'm trying to do an rpsblast using ncbi-blast-2.5.0+. I have a file containing gi numbers as my query and my command line is below.

rpsblast -query t -db results/Cog/Cog -out rps-blast.out -evalue 1e-2 -outfmt 6


Oddly the rpbslast keeps giving me this error:

Warning: [rpsblast] Error initializing remote BLAST database data loader: Protein BLAST database 'Cog/Cog nr' does not exist in the NCBI servers


My query file (t) looks like this. I've tried to remove search using just the numbers (292833481) but it still doesn't solve the problem. gi|292833481 gi|383341230 gi|289693981

However, when I try to the same search but using a fasta file as query, it runs fine and gives the results I need.

rpsblast -query GCF_000005845.2_ASM584v2_protein.faa -db results/Cog/Cog -out rps-blast.out -evalue 1e-2 -outfmt 6


Is there something that I did wrong here? What is the correct way to format query a list of gi's for blast? Thanks!

blast • 2.2k views
2
Entering edit mode

NCBI has stopped using gi numbers externally since September 2016. You should substitute the gi numbers with accession numbers.

0
Entering edit mode

This is probably THE correct answer to this question.

0
Entering edit mode

Thank you for your response! Unfortunately, I didn't see any improvement when I converted my query from gi number to accession (eg EFL06024.1). The error regarding "Protein BLAST database 'Cog/Cog nr' does not exist in the NCBI servers" still there....

0
Entering edit mode

How old is your rps-blast? From https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMDWeb&PAGE_TYPE=BlastNews:

Two BLAST+ features require communication with the NCBI website (and HTTPS). First, the –remote flag sends the search to the NCBI for processing. Second, BLAST can take a sequence ID as a query and retrieve the sequence from the NCBI. You should update your BLAST+ executables by November 9, 2016 to ensure that these features continue to work. More information about the HTTPS transition is available at https://www.ncbi.nlm.nih.gov/home/develop/https-guidance.shtml

0
Entering edit mode

NCBI is hiding GI numbers from inexperienced kids now, but you can still use them in blast queries and eutils. (see below)

0
Entering edit mode

I could reproduce your problem using a valid identifier, maybe it is a bug? Valid identifier worked online.

Fast solution is download the sequences of interest, and use a fasta file.

edit: are you using rpsblast or rpsblast+?

0
Entering edit mode

Hi h.mon, I'm using rpsblast+ from blast+ package 2.5.0 and 2.6.0(latest). Yes, I suppose downloading fasta files will be the quickest solution, although my queries are rather large (>700K). NCBI's CD-search accept gi/accession number as query (https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi) but they only allow 4000 queries each time :/ I've emailed ncbi-help and will update accordingly. Thanks for your help!

1
Entering edit mode

Shouldn't you edit your post then? If I try rpsblast instead of rpsblast+, I get a lot of errors due to incorrect parsing of the arguments.

0
Entering edit mode
3.7 years ago
bitpir ▴ 190

Hi all, Thank you for the responses. Somehow I am able to use GI numbers as queries for rpsblast using ncbi-blast-2.6.0. This was also tested by Dr. Wayne Matten from NCBI User Services.Here's what I did:

1. Sort and output only unique GI numbers, else a C++ exception error will occur (see ncbi error C++ exception ). My query, looks something like this: gi|545375397 545375577 545375696

makeprofiledb -title SMART.v6.0 -in Smart.pn -out Smart -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title Pfam.v.26.0 -in Pfam.pn -out Pfam -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title COG.v.1.0 -in Cog.pn -out Cog -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title KOG.v.1.0 -in Kog.pn -out Kog -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title CDD.v.3.12 -in Cdd.pn -out Cdd -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title CDD_NCBI.v.3.12 -in Cdd_NCBI.pn -out Cdd_NCBI -threshold 9.82 -scale 100.0 -dbtype rps -index true makeprofiledb -title PRK.v.6.00 -in Prk.pn -out Prk -threshold 9.82 -scale 100.0 -dbtype rps -index true

4. Before running the job, go to the directory where you save the database

5. Run rpsblast:

/ncbi-blast-2.6.0+/bin/rpsblast -query /data/test -db Pfam -out /data/test_result -evalue 0.0001 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -max_target_seqs 20 -num_threads 16

6. As the job runs, I saw this error message:

Warning: [rpsblast] Error initializing remote BLAST database data loader: Protein BLAST database 'Pfam nr' does not exist in the NCBI servers

7. However, the search results are still being saved to my output file. The results look something like this:

gi|545375397|ref|XP_005651884.1| gnl|CDD|279733 28.856 402 238 13 5 376 64 447 1.44e-64 212 gi|545375577|ref|XP_005651955.1| gnl|CDD|306679 36.207 174 108 2 46 219 1 171 3.07e-37 126 gi|545375696|ref|XP_005652002.1| gnl|CDD|306830 43.590 39 22 0 96 134 1 39 1.97e-08 47.4 gi|545375696|ref|XP_005652002.1| gnl|CDD|306830 48.571 35 18 0 58 92 5 39 6.40e-08 46.2 gi|545375696|ref|XP_005652002.1| gnl|CDD|306830 47.368 38 20 0 185 222 2 39 8.56e-07 42.7 gi|545375696|ref|XP_005652002.1| gnl|CDD|306830 31.818 44 25 2 137 180 1 39 4.33e-06 40.8 gi|545375696|ref|XP_005652002.1| gnl|CDD|306830 36.111 36 21 1 9 44 5 38 8.10e-05 37.3

Woohoo! There you have it. I apologize if the answer is disorganized, not very certain how to use the formatting tools :/ Hope this is helpful!

0
Entering edit mode

not very certain how to use the formatting tools

Pretty simple. Highlight code that you want to mark as such and then use the "101" button in the edit window. That should correctly format the text. Note: I am not sure what you did with the text above but I am not able to correctly format it.

I am curious as to how long the gi numbers are going to keep working since NCBI is on record as saying that users need to move away from using them.

0
Entering edit mode
3.7 years ago
piet ★ 1.8k

I do not have the cog database installed on my machine, but some tests with blastp instead of rpsblast indicate, that it is neccessary to put every GI number in a separate line:

echo 'gi|292833481\ngi|383341230\n' | blastp -task blastp-fast -remote -db nr -outfmt 6


The above command will emit lots of error messages, but after 10 min it has finished successfully.

Another test case to prove the ability of the local blast client to lookup GI numbers and retrieve sequences from NCBI :

  echo 'gi|114050348\n' > q.txt
echo 'gi|57284222\n' > s.txt
blastn -query q.txt -subject s.txt -outfmt 6


We specify two sequences by their GI numbers, and then blast the shorter query sequence against the longer subject (like the deprecated bl2seq). The expected outcome is:

   AB234058.1      CP000046.1      99.888  889     1       0       1       889     1666415 1665527 0.0     1637


Nevertheless, I regard this automatic sequence retrieval by the local blast client as not really stable and mature.

inloraj, if you already have a list of GI numbers, than you can easily download the sequences with Eutils efetch, and then feed them into rpsblast.

  wget 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&retmode=txt&id=gi|292833481,gi|383341230,gi|292833481' -O queries.fas

0
Entering edit mode

This just confirms it is a bug in command-line rpsblast+. I could run blastp and blastn using accessions, but the same accessions which worked for blastp failed for rpsblast+.