Question: do blast search using gi numbers
0
gravatar for bitpir
10 days ago by
bitpir10
bitpir10 wrote:

Hi, I'm trying to do an rpsblast using ncbi-blast-2.5.0+. I have a file containing gi numbers as my query and my command line is below.

rpsblast -query t -db results/Cog/Cog -out rps-blast.out -evalue 1e-2 -outfmt 6

Oddly the rpbslast keeps giving me this error:

Warning: [rpsblast] Error initializing remote BLAST database data loader: Protein BLAST database 'Cog/Cog nr' does not exist in the NCBI servers

My query file (t) looks like this. I've tried to remove search using just the numbers (292833481) but it still doesn't solve the problem. gi|292833481 gi|383341230 gi|289693981

However, when I try to the same search but using a fasta file as query, it runs fine and gives the results I need.

rpsblast -query GCF_000005845.2_ASM584v2_protein.faa -db results/Cog/Cog -out rps-blast.out -evalue 1e-2 -outfmt 6

Is there something that I did wrong here? What is the correct way to format query a list of gi's for blast? Thanks!

blast • 150 views
ADD COMMENTlink modified 10 days ago by piet1.4k • written 10 days ago by bitpir10
2

NCBI has stopped using gi numbers externally since September 2016. You should substitute the gi numbers with accession numbers.

ADD REPLYlink modified 10 days ago • written 10 days ago by genomax31k

This is probably THE correct answer to this question.

ADD REPLYlink written 10 days ago by h.mon7.5k

Thank you for your response! Unfortunately, I didn't see any improvement when I converted my query from gi number to accession (eg EFL06024.1). The error regarding "Protein BLAST database 'Cog/Cog nr' does not exist in the NCBI servers" still there....

ADD REPLYlink written 10 days ago by bitpir10

How old is your rps-blast? From https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMDWeb&PAGE_TYPE=BlastNews:

Two BLAST+ features require communication with the NCBI website (and HTTPS). First, the –remote flag sends the search to the NCBI for processing. Second, BLAST can take a sequence ID as a query and retrieve the sequence from the NCBI. You should update your BLAST+ executables by November 9, 2016 to ensure that these features continue to work. More information about the HTTPS transition is available at https://www.ncbi.nlm.nih.gov/home/develop/https-guidance.shtml

ADD REPLYlink written 10 days ago by h.mon7.5k

NCBI is hiding GI numbers from inexperienced kids now, but you can still use them in blast queries and eutils. (see below)

ADD REPLYlink written 10 days ago by piet1.4k

I could reproduce your problem using a valid identifier, maybe it is a bug? Valid identifier worked online.

Fast solution is download the sequences of interest, and use a fasta file.

edit: are you using rpsblast or rpsblast+?

ADD REPLYlink modified 10 days ago • written 10 days ago by h.mon7.5k

Hi h.mon, I'm using rpsblast+ from blast+ package 2.5.0 and 2.6.0(latest). Yes, I suppose downloading fasta files will be the quickest solution, although my queries are rather large (>700K). NCBI's CD-search accept gi/accession number as query (https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi) but they only allow 4000 queries each time :/ I've emailed ncbi-help and will update accordingly. Thanks for your help!

ADD REPLYlink written 10 days ago by bitpir10

Shouldn't you edit your post then? If I try rpsblast instead of rpsblast+, I get a lot of errors due to incorrect parsing of the arguments.

ADD REPLYlink modified 8 days ago • written 8 days ago by h.mon7.5k
0
gravatar for piet
10 days ago by
piet1.4k
planet earth
piet1.4k wrote:

I do not have the cog database installed on my machine, but some tests with blastp instead of rpsblast indicate, that it is neccessary to put every GI number in a separate line:

echo 'gi|292833481\ngi|383341230\n' | blastp -task blastp-fast -remote -db nr -outfmt 6

The above command will emit lots of error messages, but after 10 min it has finished successfully.

Another test case to prove the ability of the local blast client to lookup GI numbers and retrieve sequences from NCBI :

  echo 'gi|114050348\n' > q.txt
  echo 'gi|57284222\n' > s.txt
  blastn -query q.txt -subject s.txt -outfmt 6

We specify two sequences by their GI numbers, and then blast the shorter query sequence against the longer subject (like the deprecated bl2seq). The expected outcome is:

   AB234058.1      CP000046.1      99.888  889     1       0       1       889     1666415 1665527 0.0     1637

Nevertheless, I regard this automatic sequence retrieval by the local blast client as not really stable and mature.

inloraj, if you already have a list of GI numbers, than you can easily download the sequences with Eutils efetch, and then feed them into rpsblast.

  wget 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&retmode=txt&id=gi|292833481,gi|383341230,gi|292833481' -O queries.fas
ADD COMMENTlink written 10 days ago by piet1.4k

This just confirms it is a bug in command-line rpsblast+. I could run blastp and blastn using accessions, but the same accessions which worked for blastp failed for rpsblast+.

ADD REPLYlink written 8 days ago by h.mon7.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1534 users visited in the last hour