Question: Blastp - through R
0
gravatar for maahpishanu
3.5 years ago by
maahpishanu0 wrote:

Hi,

I'm supposed to do reciprocal best hit using blastp for a long list of proteins for which I'm given only the UniProtKB/TrEMBL identifier. I have to find the ortholog proteins for each of the given proteins. I also have to find orthologs in a given list of species.

Previously, when I had only one Protein, I would do it through online tool in NCBI, there I can give the identifier and also limit the species. However, with more proteins in hand, this is not feasible anymore.

I downloaded Blast, but I cannot download any databases (memory problem, and I need refseq_protein database).

I'm familiar with R and would like to do it in R, but sending online queries to the NCBI blastp.

I searched a lot and was not able to find hint.

I installed orthologr, biomaRt and some other packages, but they only used local databases.

I would appreciate it a lot if somebody could help me how I can send online query through R. I have an R code and somewhere in the code I need to call blastp to find orthologs of my proteins among the given list of species.

 

Thanks and looking forward.

Maah

blast tool R • 2.6k views
ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by maahpishanu0
0
gravatar for Michael Dondrup
3.5 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

There is no advantage in using R to start a long running blast process, instead just start Blast+ from the commandline using the -remote option, that does what you describe above. R does not have a good blast parser unlike perl and python, you are bound to the tabular formats (6,7, or 10).

ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastdb.html has an overview of available blast databases. See the manual pages for more help.

For your case the commandline looks somewhat like:

blastp -remote -db refseq_protein -query my.fasta -outfmt 6 > blasttab.txt

or use the -entrez_query parameter instead of -query, if you want to run the search without making your own fasta file, see

http://www.ncbi.nlm.nih.gov/books/NBK3837/

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Michael Dondrup46k

Thanks a lot for your response.

Is it also possible to give protein sequence identifier instead of fasta file? In the web tool this is possible, like if you enter:

O48946

Is there anyway to specify the species as well?

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by maahpishanu0

i think it could work through the -entrez_query parameter but I haven't tried.

ADD REPLYlink written 3.5 years ago by Michael Dondrup46k

I figured this out, no worries for this.

ADD REPLYlink written 3.5 years ago by maahpishanu0

Would you mind please send me the link for python?

 

I thought you know a good manual on how to do blast in biopython and you could maybe share it with me, just to save time.

Otherwise I know how to search it :)

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by maahpishanu0

google:biopython

ADD REPLYlink written 3.5 years ago by Michael Dondrup46k
0
gravatar for maahpishanu
3.5 years ago by
maahpishanu0 wrote:

I found that I can send my query like the following through R:

 

system2('blastp',c('-remote', "-best_hit_overhang", "0.1", "-best_hit_score_edge", "0.1",'-db', "'refseq_protein'" ,'-entrez_query', "prf||O48946"),stdout=TRUE)

However, I get the following error:

 

[1] "Error: (301.23) [CONN_Read(blast4/HTTP; http://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=blast4&address=W7DELL902035.mpimp-golm.mpg.de&platform=i386-pc-win64)]  Unable to read data: Timeout[30.000000]"
[2] "Warning: Error initializing remote BLAST database data loader: Protein BLAST database ''refseq_protein'' does not exist in the NCBI servers"                                                                  
[3] "Command line argument error: Query is Empty!"                                                                                                                                                                 
attr(,"status")
[1] 1
Warning message:
running command '"blastp" -remote -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -db 'refseq_protein' -entrez_query prf||O48946' had status 1 

Could anybody help me with this?

ADD COMMENTlink written 3.5 years ago by maahpishanu0

You can do it that way, calling via R, but here you can exactly see the problem: its getting too complicated for a beginner. You have additional quotes here around the db which should be removed, again there is no advantage in using system in R over typing this into your console first.

ADD REPLYlink written 3.5 years ago by Michael Dondrup46k

system2('blastp',c('-remote', "-best_hit_overhang", "0.1", "-best_hit_score_edge", "0.1",'-db', "'refseq_protein'" ,'-entrez_query', "prf||O48946"),stdout=TRUE)

should be:
system2('blastp',c('-remote', "-best_hit_overhang", "0.1", "-best_hit_score_edge", "0.1",'-db', 'refseq_protein' ,'-entrez_query', "prf||O48946"),stdout=TRUE)

Hint: just make it a custom to use only one type of quotes throughout, either single or double, in R there is no semantic difference between the different quote types, unlike e.g. in perl. quoting errors are sometimes hard to spot.

 

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Michael Dondrup46k

Thanks a lot. You are right with respect to the quotation. I found this code in Biostar and I thought quotations should differ for the sake of blast query.

ADD REPLYlink written 3.4 years ago by maahpishanu0

Where did you find it? Then its possibly also wrong there.

ADD REPLYlink written 3.4 years ago by Michael Dondrup46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 621 users visited in the last hour