Question: Downloading full BlastKOALA results?
0
gravatar for EduardoFox
12 weeks ago by
EduardoFox10
IBCCF / UFRJ
EduardoFox10 wrote:

I have just started using BlastKOALA KEGG which has been useful in annotating (aminoacid) sequences. This is their website: https://www.kegg.jp/blastkoala/

When you get results, there are links for downloading. However these links will not download all detailed query search results, but just general notes already on the screen. To get these results I need to manually click on each query result on the page, which becomes impracticable with >500 entries. Thus I think need is a tool to download all linked contents from a webpage. I have been trying 'wget' however it doesn't work. It says 'Requested Job Not Found' whatever I do.

Please, did anyone every try to achieve this? Thanks in advance.

ADD COMMENTlink modified 12 weeks ago by lelle770 • written 12 weeks ago by EduardoFox10
1
gravatar for lelle
12 weeks ago by
lelle770
Berlin
lelle770 wrote:

I had quick look at this on my blastKOALA Result.

When I click on one of my queries I get a detailed list of matches. The list has an URL like this:

https://www.kegg.jp/kegg-bin/blastkoala_result_gene_list?id=39732d974cf46cbc344f96d5d7e81bb69c18dcea&passwd=x3XXyz&type=blastkoala&code=user&target=g1%2Et1

If I run

wget "https://www.kegg.jp/kegg-bin/blastkoala_result_gene_list?id=39732d974cf46cbc344f96d5d7e81bb69c18dcea&passwd=x3XXyz&type=blastkoala&code=user&target=g1%2Et1" -O g1.t1_hits.html

I get a file called g1.t1_hits.html (because of the -O option).

If I change the last parameter of the URL (target=g1%2Et1) to a different protein name I get the result of the according protein.

Maybe you are missing the quotation marks in your wget command?

ADD COMMENTlink written 12 weeks ago by lelle770

Thanks for testing the download ! However you will see that the downloaded page is just what already shows in the screen, which I could easily get by selecting all and pasting to a text editor. I'd like to download the detailed results for each queried protein which you can only see by directly clicking on it. In other words, I'd like to download all HTML pages linked to the page you just downloaded. Please, would you know how to set this in wget? I cannot get all links. Thanks!

ADD REPLYlink written 12 weeks ago by EduardoFox10
1

the way I would do this is by writing a bash script that calls wget with each protein ID. Something like this:

while read PROT; do
  echo "$PROT"
  wget "https://www.kegg.jp/kegg-bin/blastkoala_result_gene_list?id=39732d974cf46cbc344f96d5d7e81bb69c18dcea&passwd=XXxxXX&type=blastkoala&code=user&target=${PROT}" -O ${PROT}_koala.html
done 

Where prot.txt is a file with one protein ID per line

ADD REPLYlink written 12 weeks ago by lelle770
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1615 users visited in the last hour