Question

how to download all sequences from web link

2

Entering edit mode

5.1 years ago

Bioinfonext ▴ 460

Hi,

Could you please suggest to me how I can get all these 82697 sequences from this website using the linux command:

http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=1

Thanks

bash linux • 4.4k views

ADD COMMENT • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Not exactly sure what you are asking. If you select all sequences and click begin analysis then it takes you to a new page where there is a download button to get "protein" or "nucleotide" sequence downloads.

Edit: You can only download 10000 sequences at a time so you will need to chunk through this multiple times.

ADD REPLY • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

Thanks a lot for all your help SMK, Should I run the whole script like this again, do I also need to delete the previous nucl_urls.txt file.

#!/bin/bash
API_KEY="REPLACE_WITH_YOUR_API_KEY"

for p in {1..83}; do
  curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
    | grep "gbnucdata" \
    | sed -r "s|.+seqAccno=(.+)\&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1\&api_key=${API_KEY}|" \
    >> nucl_urls.txt
done
grep -w -v -f <(grep '^>' part0*.fasta | awk -F":" '{gsub(">", "", $2); gsub("\\.[0-9]+", "", $2); print $2}') nucl_urls.txt > previously_failed.txt
wget -O previously_failed.fa -i previously_failed.txt

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Hi,

I used the above script and it only able to download 251 sequences and finished with some error:

Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2019-07-04 11:03:17 ERROR 400: Bad Request.

FINISHED --2019-07-04 11:03:17--
Total wall clock time: 3m 51s
Downloaded: 251 files, 234K in 0.03s (7.86 MB/s)

Thanks

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

I would check if the URLs are correct by pasting the problematic URLs to the browser and see what the browser returns.

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

5.1 years ago

GenoMax 144k

Since I was unable to get @SMK's method to work I reused a part of his code to come up with a method to use Entrezdirect. This example is just using the first page from the original website.

$ for p in {1..1}; do curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" | grep "gbnucdata" | awk -F "=|&" '{print $3,$5,$7}' | xargs -n 3 sh -c ' efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta' > seqeuences.fa ;done

ADD COMMENT • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

Thanks, for downloading all, do we here need API_key and do I also need to install E-utilities? for E-utilities, I already requested to admin to install it. I can load it easily in scripts, once it is available.

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

You would need API_KEY for all large downloads. If @SMK's method is working for you I suggest you stick with it. You had managed to download about half sequences yesterday, correct?

ADD REPLY • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

Yes, @SMK script is working fine....after splitting the nucl.url.txt...but I did not submit it on HPC..just running by calling bash download.sh.

For future bulk download I want to learn Eutilies.

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

It is like this:

--2019-07-04 11:03:14--  http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=MPQZ01000001&seq_start=109958&seq_stop=110776&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=MPQZ01000001&seq_start=109958&seq_stop=110776&strand=1&api_key=redacted [following]
--2019-07-04 11:03:14--  https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=MPQZ01000001&seq_start=109958&seq_stop=110776&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘previously_failed.fa’

    [ <=>                                   ] 953         --.-K/s   in 0s      

2019-07-04 11:03:15 (9.84 MB/s) - ‘previously_failed.fa’ saved [953]

--2019-07-04 11:03:15--  http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=LK031773&seq_start=53141&seq_stop=54004&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=LK031773&seq_start=53141&seq_stop=54004&strand=1&api_key=redacted [following]
--2019-07-04 11:03:15--  https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=LK031773&seq_start=53141&seq_stop=54004&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘previously_failed.fa’

    [ <=>                                   ] 971         --.-K/s   in 0s      

2019-07-04 11:03:15 (4.81 MB/s) - ‘previously_failed.fa’ saved [971]

--2019-07-04 11:03:15--  http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOIL01000062&seq_start=38464&seq_stop=39354&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOIL01000062&seq_start=38464&seq_stop=39354&strand=1&api_key=redacted
--2019-07-04 11:03:16--  https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOIL01000062&seq_start=38464&seq_stop=39354&strand=1&api_key=redacted
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘previously_failed.fa’

    [ <=>                                   ] 1,006       --.-K/s   in 0s      

2019-07-04 11:03:16 (16.7 MB/s) - ‘previously_failed.fa’ saved [1006]

--2019-07-04 11:03:16--  http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo [following]
--2019-07-04 11:03:17--  https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo
Connecting to www.ncbi.nlm.nih.gov www.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 400 Bad Request
2019-07-04 11:03:17 ERROR 400: Bad Request.

FINISHED --2019-07-04 11:03:17--
Total wall clock time: 3m 51s
Downloaded: 251 files, 234K in 0.03s (7.86 MB/s)

$ grep -c '^>' previously_failed.fa 
251

Edit: API user keys redacted @GenoMax.

ADD REPLY • link updated 5.1 years ago by GenoMax 144k • written 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

It seems the last one is broken?

http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo

Can you trace back and fix the URLs? And remember to edit the post to remove your actual API KEY so that other people won't see it from here (c0ebfXXXXXX)...

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

yes, I am using the same API key which I used in the last script from my NCBI account.

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

should rerun the script?

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

but in the previous.failed.txt total gene ID id 251 only: could it be possible there are multiple same gene ID? Could we cross-check it from nucl.url.txt?

$ wc -l previously_failed.txt

251 previously_failed.txt

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

It seems the last one is broken?

Please, check the content of the URLs!!!!!! As I said before from your log, some of the URLs are be broken:

http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo

If it was broken like that then you need to find out why and fix the URLs before re-running the script.

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

Hi SMK and genomax

thank a lot. I downloaded by using below command 82692 sequences successfully, but in this case, I have removed part.fasta to another folder and this has created all 82697 link in previously.failed.txt and finally, I got 82692 sequences in previously.failed.fa.

grep -w -v -f <(grep '^>' part0*.fasta | awk -F":" '{gsub(">", "", $2); gsub("\\.[0-9]+", "", $2); print $2}') nucl_urls.txt > previously_failed.txt
wget -O previously_failed.fa -i previously_failed.txt

Now I will try to grab the remaining 5 sequences by comparing all part.fasta and previously.failed.fa based on geneID.

thanks Bioinfornext

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

the end of previous_failed.txt is looks like this: BROKEN URL

http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=LK031773&seq_start=53141&seq_stop=54004&strand=1&api_key=c0ebf5aa4469318880bb45c13a88906b5dc08
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOIL01000062&seq_start=38464&seq_stop=39354&strand=1&api_key=c0ebf5aa4931880bb45c13a88906b5dc08
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleo

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

1

Entering edit mode

Bioinfonext : While it is tempting to post every error you get, these are things you need to work on and fix yourself. This is part of the learning process.

ADD REPLY • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

thanks for all help, sorry posted other issue here by mistake!

ADD REPLY • link 5.0 years ago by Bioinfonext ▴ 460

GenoMax · Accepted Answer · 2019-07-01

3

Entering edit mode

5.1 years ago

AK ★ 2.2k

(edited) To download the nucleotide sequences into 4 parts simultaneously:

API_KEY="REPLACE_WITH_YOUR_NCBI_EUTILS_API_KEY"

for p in {1..83}; do
  curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
    | grep "gbnucdata" \
    | sed -r "s|.+seqAccno=(.+)\&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1\&api_key=${API_KEY}|" \
    >> nucl_urls.txt
done

split -d -l 21000 nucl_urls.txt part
ls part0* | parallel -j 4 wget -O {}.fasta -i {}

(original answer to download protein sequences) You can download the sequences in this way (here for example all the protein sequences):

# First get all the accession numbers of protein sequences in all the pages
for p in {1..83}; do
  curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
    | grep "gpprotdata.jsp?seqAccno" \
    | sed -r 's|.+>(.+)<\/a><\/td>|\1|' \
    >> prot_acc.txt
done

for acc in $(cat prot_acc.txt); do
  wget -O ${acc}.fasta "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&id=${acc}"
done

ADD COMMENT • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

thanks SMK.

I need to download nucleotide sequences according to gene ID so I tried to make some changes but it is showing some error:

    #!/bin/bash

# the accession numbers of protein sequences in all the pages
for p in {1..83}; do
  curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
    | grep "gpnucldata.jsp?seqgi” \
    | sed -r 's|.+>(.+)<\/a><\/td>|\1|' \
    >> gi_nucl.txt
done

# Then download the nucleotide sequences, for example:
for gi in $(cat gi_nucl.txt); do
i  wget -O ${gi}.fasta "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=${gi}"
done

error:

 $bash download.sh 
download.sh: line 16: unexpected EOF while looking for matching `"'
download.sh: line 18: syntax error: unexpected end of file

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

First thing I saw is the line of | grep "gpnucldata.jsp?seqgi” \. Try changing the end character to ". Also, I'd like to suggest to add set -euo pipefail at the next line of #!/bin/bash. So it stops immediately when it gets an error. And, you got an i at the line of i wget -O?

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

$ bash download.sh

download.sh: line 13: unexpected EOF while looking for matching `"'

  #!/bin/bash
set -euo pipefail
# the accession numbers of nucleotide sequences in all the pages
for p in {1..83}; do
  curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
    | grep "gpnucldata.jsp?seqgi”
    | sed -r 's|.+>(.+)<\/a><\/td>|\1|' \
    >> gi_nucl.txt
done

# Then download the nucleotide sequences, for example:
for gi in $(cat gi_nucl.txt); do
wget -O ${gi}.fasta "http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=${gi}"
done

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

For nucleotide, you have to consider seq_start and seq_stop. For example, entry AOJK01000067 will be downloaded from http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&seq_start=53467&seq_stop=54312&strand=1&id=AOJK01000067.

Try this one:

#!/bin/bash
set -euo pipefail

for p in {1..83}; do
  curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
    | grep "gbnucdata" \
    | sed -r 's|.+seqAccno=(.+)&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1|' \
    >> nucl_urls.txt
done

wget -O hmm_id_721_nucl.fa -i nucl_urls.txt

Where nucl_urls.txt contains:

$ head nucl_urls.txt
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AM774418&seq_start=280304&seq_stop=281149&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOJK01000067&seq_start=53467&seq_stop=54312&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AF016485&seq_start=135862&seq_stop=137037&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOIT01000069&seq_start=1403&seq_stop=2242&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=LOPV01000533&seq_start=1144&seq_stop=1992&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOLJ01000027&seq_start=50569&seq_stop=51417&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOLH01000006&seq_start=26854&seq_stop=27702&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=BASG01000071&seq_start=6373&seq_stop=7167&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AUXP01000135&seq_start=5869&seq_stop=6663&strand=1
http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=MDCP01000151&seq_start=2694&seq_stop=3488&strand=1

When you copy the codes, make sure the end double quotation mark is in the correct format. It should work.

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

Hi SMK,

This script is keep running from last 5h, what do you think how much time will it take to download all sequences?

Thanks Bioinfonext

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

As long as it takes to download 80K+ genes.

It may have proved simpler to get the nr blast database and extract the sequences you need from it using blastdbcmd.

ADD REPLY • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

Indeed. Bioinfonext, you can check how many sequences are downloaded by: grep -c '^>' hmm_id_721_nucl.fa.

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

Thanks genomax, I already downloaded nr nucleotide database...could you suggest how blastdbcmd will work..what are the step?

How I can extract geneID from nucl_urls.txt file?

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

This would work only for nr database (proteins). For nucleic acid you don't have this option since you need to consider a start/stop (as shown by @SMK, so that is the only solution).

$ blastdbcmd -db /path_to/nr_v5 -entry AAC82905 -outfmt %f
>O52025.1 RecName: Full=Arsenite methyltransferase [Halobacterium salinarum NRC-1] >AAC82905.1 unknown [Halobacterium salinarum NRC-1]
MELWTHPTPAAPRLATSTRTRWRRTSRCSQPWATTPGTNSSDASRTPTTASASATSKPQSASARARSVRRSPDCTPRAWS
RGARKDRGATTNRPRRPKFCSKRSTTCEATMSNDNETMVADRDPEETREMVRERYAGIATSGQDCCGDVGLDVSGDGGCC
SDETEASGSERLGYDADDVASVADGADLGLGCGNPKAFAAMAPGETVLDLGSGAGFDCFLAAQEVGPDGHVIGVDMTPEM
ISKARENVAKNDAENVEFRLGEIGHLPVADESVNVVISNCVVNLAPEKQRVFDDTYRVLRPGGRVAISDVVQTAPFPDDV
QMDPDSLTGCVAGASTVDDLKAMLDEAGFEAVEIAPKDESTEFISDWDADRDLGEYLVSATIEARKPARDD

ADD REPLY • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

Hi,

till now these sequences retrieved:

grep -c '^>' hmm_id_721_nucl.fa

31021

but now It is just keep showing msg like this:Reusing existing connection to www.ncbi.nlm.nih.gov:443. HTTP request sent, awaiting response... 200 OK

and no further gene sequences retrieved.

I run it as bash script on HPC. Should I need to submit this script on HPC?

tmode=text&rettype=fasta&id=JXLL01000019&seq_start=24218&seq_stop=24844&strand=1
Reusing existing connection to www.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘hmm_id_721_nucl.fa’

hmm_id_721_nucl.fa         [ <=>                        ]     732  --.-KB/s    in 0s      

2019-07-02 18:38:35 (13.6 MB/s) - ‘hmm_id_721_nucl.fa’ saved [732]

URL transformed to HTTPS due to an HSTS policy
--2019-07-02 18:38:35--  https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=FOBO01000014&seq_start=86078&seq_stop=87025&strand=1
Reusing existing connection to www.ncbi.nlm.nih.gov:443.
HTTP request sent, awaiting response... No data received.
Retrying.

Thanks Bioinfonext

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

You are most likely running afoul of number of connections/queries allowed by NCBI per IP address. You may want to split nucl_urls.txt into smaller chunks past the point where downloads have been successful and then run those pieces sequentially allowing one download to complete before starting next.

ADD REPLY • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

thanks genomax,

I splited nucl_urls.txt into four parts: :

nucl_urls1.txt, nucl_urls2.txt nucl_urls3.txt, nucl_urls4.txt,

but now not sure how to change the script or should I just change the page number in script like first to download from 1-25, and then page 26 to 50.....like that:

#!/bin/bash
set -euo pipefail

for p in {1..25}; do
  curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
    | grep "gbnucdata" \
    | sed -r 's|.+seqAccno=(.+)&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1|' \
    >> nucl_urls.txt
done

wget -O hmm_id_721_nucl.fa -i nucl_urls.txt

Thanks for all help and your valuable time!

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Hey Bioinfonext,

Got an idea: (1) Apply for an API_KEY (2) Split and download each chunk simultaneously. Have a read at https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/. It seems:

your key will increase the limit to 10 requests/second for all activity from that key

Thus the approach can be extended to (remember to remove nucl_urls.txt if there is, and note that the sed line is changed with &api_key=${API_KEY} added):

API_KEY="REPLACE_WITH_YOUR_API_KEY"

for p in {1..83}; do
  curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
    | grep "gbnucdata" \
    | sed -r "s|.+seqAccno=(.+)\&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1\&api_key=${API_KEY}|" \
    >> nucl_urls.txt
done

split -d -l 21000 nucl_urls.txt part
ls part0* | parallel -j 4 wget -O {}.fasta -i {}

Which will download 4 parts at a time. Also makes sure that there are 82,697 entries in nucl_urls.txt:

$ wc -l nucl_urls.txt
82697 nucl_urls.txt

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

Hi, It Is showing some error:

$ bash download.sh 
download.sh: line 12: parallel: command not found

It generated few files

  $ ls
1            nucl_urls.txt  part01  part03  part05  part07
download.sh  part00         part02  part04  part06  part08

$ wc -l nucl_urls.txt
82697 nucl_urls.txt

and the script is:

#!/bin/bash
API_KEY="REPLACE_WITH_YOUR_API_KEY"

for p in {1..83}; do
  curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
    | grep "gbnucdata" \
    | sed -r "s|.+seqAccno=(.+)\&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1\&api_key=${API_KEY}|" \
    >> nucl_urls.txt
done

split -d -l 10000 nucl_urls.txt part
ls part0* | parallel -j 4 wget -O {}.fasta -i {}

Thanks

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Hey Bioinfonext,

You have to download and install the program called parallel, at https://savannah.gnu.org/projects/parallel/

And I hope in the script that you executed, the API_KEY is your own version.

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

thanks, I am working on HPC server and I loaded the parallel in script by adding the below command:

 module load apps/parallel/20151222/gcc-4.8.5

lets see if it works!

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

do you also think to change this code:

API_KEY="REPLACE_WITH_YOUR_API_KEY"

How can I found the API_Key for the loaded parallel module?

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Hey Bioinfonext,

Read that blog...... it's for NCBI E-utilities API.

Got an idea: (1) Apply for an API_KEY (2) Split and download each chunk simultaneously. Have a read at https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/.

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

thanks, I got API key, but I just running it on HPC by calling with bash

bash download.sh

do you think should I submit this script to HPC using sbatch? If I run with bash will it harm the HPC?

thanks

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Just remember to request the same number of threads that you wrote in the script (should be 4 in the example, used by parallel).

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

I just added this part to script:

#!/bin/bash
#SBATCH --job-name=DOWNLAOD
#SBATCH --output=/users/3052771/sharedscratch/arsenic_amplicon
#SBATCH --ntasks=20
#SBATCH --time=80:00:00

should I use 4 instead of 20 here in tasks

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

I just added this part to script:

#!/bin/bash
#SBATCH --job-name=DOWNLAOD
#SBATCH --output=/users/3052771/sharedscratch/arsenic_amplicon
#SBATCH --ntasks=20
#SBATCH --time=80:00:00

should I use 4 instead of 20 here in ntasks?

do I also need to add current working directory here but I am not sure how to add current working directory, our HPC just upgraded to slurm so not sure about it!

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

You can explicitly save the output files to directory you want otherwise they would be saved in the directory you run this script from.

Here is some food for thought:

You could basically submit a separate job to download each link in URL's file via sbatch. That way a certain number of jobs (depending on job slot limit on your account) will start and rest will pend. As one job completes next in line would get pulled in. You may need to save output files separately and then cat them into a big file later. You may want to weigh this option depending on if/how you are charged for use of compute resources.

ADD REPLY • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

Hi genomax,

could you please help me to modify script according to slurm HPC?

I will be thankful for your time and help.

Regards Bioinfonext

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

In theory you could so something like this.

Note: I can't get this to work (getting HTTP error with the links that @SMK's script generates. You will need to adjust SLURM options as needed.

$ num=0;for i in `cat nucl_urls.txt`; do echo sbatch -t 1-0 -p htsf --wrap=\"wget -O ${num}.fa ${i}\"; num=$((num+1)); done
sbatch -t 1-0 -p partition --wrap="wget -O 0.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AM774418&seq_start=280304&seq_stop=281149&strand=1"
sbatch -t 1-0 -p partition --wrap="wget -O 1.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOJK01000067&seq_start=53467&seq_stop=54312&strand=1"
sbatch -t 1-0 -p partition --wrap="wget -O 2.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AF016485&seq_start=135862&seq_stop=137037&strand=1"

Edit: You will need to remove echo and \ before " to actually submit the jobs.

ADD REPLY • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

1st script to get nucl_urls.txt:

#!/bin/bash
API_KEY="REPLACE_WITH_YOUR_API_KEY"

for p in {1..83}; do
  curl -s "http://fungene.cme.msu.edu/hmm_detail.spr?hmm_id=721&page=${p}" \
    | grep "gbnucdata" \
    | sed -r "s|.+seqAccno=(.+)\&format=Genbank.+|http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide\&retmode=text\&rettype=fasta\&id=\1\&api_key=${API_KEY}|" \
    >> nucl_urls.txt
done

and then second script to do partition and download: do we need here API_KEY="REPLACE_WITH_YOUR_API_KEY", I am not sure I can insert it here if needed? do we also need to give input nucl_urls.txt as a variable?

#!/bin/bash
#SBATCH --job-name=DOWNLAOD
#SBATCH --output=/users/3052771/sharedscratch/arsenic_amplicon
#SBATCH --ntasks=20
#SBATCH --time=80:00:00

num=0;for i in `cat nucl_urls.txt`; do echo sbatch -t 1-0 -p htsf --wrap=\"wget -O ${num}.fa ${i}\"; num=$((num+1)); done
sbatch -t 1-0 -p partition --wrap="wget -O 0.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AM774418&seq_start=280304&seq_stop=281149&strand=1"
sbatch -t 1-0 -p partition --wrap="wget -O 1.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AOJK01000067&seq_start=53467&seq_stop=54312&strand=1"
sbatch -t 1-0 -p partition --wrap="wget -O 2.fa http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&retmode=text&rettype=fasta&id=AF016485&seq_start=135862&seq_stop=137037&strand=1"

ADD REPLY • link updated 5.1 years ago by GenoMax 144k • written 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Bioinfonext : Method I demonstrated above was a way to submit individual SLURM jobs directly on the command line. You either submit jobs this way or create a SLURM script like you originally posted above.

You can't combine the two methods.

You would need to use your API_KEY no matter which option you use.

ADD REPLY • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

You have to export the API key as a variable (you could add it to your bash profile).

export NCBI_API_KEY=your_key

ADD REPLY • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

Just in case it's confusing for Bioinfonext, in the example script provided, his/her API key is saved as a variable called API_KEY and appended to the end of the URL (...&rettype=fasta&id=\1&api_key=${API_KEY}).

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

You are doing the correct thing.

If API key is exported from one's profile it makes the process seamless without having to append the key for each query.

ADD REPLY • link 5.1 years ago by GenoMax 144k

0

Entering edit mode

Hi SMK,

I understood it and It seems script is working fine, but It will be great if it can modify to submit on slurm HPC.

Thanks Bioinfonext

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

Hi SMK,

Thanks for your all help and time!

It finished downloading but the total number of sequences is not 82697;

FINISHED --2019-07-03 17:39:10--
Total wall clock time: 45m 55s
Downloaded: 2696 files, 2.2M in 1.5s (1.48 MB/s)
$ ls
1              part01        part03.fasta  part06        part08.fasta
download.sh    part01.fasta  part04        part06.fasta
nucl_urls.txt  part02        part04.fasta  part07
part00         part02.fasta  part05        part07.fasta
part00.fasta   part03        part05.fasta  part08


total count:

$ grep -c '^>' part00.fasta 
8746
$grep -c '^>' part01.fasta 
8751
$ grep -c '^>' part02.fasta 
8753
$ grep -c '^>' part03.fasta 
8750
$ grep -c '^>' part04.fasta 
10000
$ grep -c '^>' part05.fasta 
10000
$ grep -c '^>' part06.fasta 
10000
$ grep -c '^>' part07.fasta 
9998
$ grep -c '^>' part08.fasta 
2696

these are 77694, could it be possible other sequences are not avaiable in NCBI database?

thanks

ADD REPLY • link 5.1 years ago by Bioinfonext ▴ 460

0

Entering edit mode

They should be there. You can try again for the failed ones by:

grep -w -v -f <(grep '^>' part0*.fasta | awk -F":" '{gsub(">", "", $2); gsub("\\.[0-9]+", "", $2); print $2}') nucl_urls.txt > previously_failed.txt
wget -O previously_failed.fa -i previously_failed.txt

ADD REPLY • link 5.1 years ago by AK ★ 2.2k

0

Entering edit mode

Using parallel on a HPC cluster is not good advice. There is no need to use parallel on the cluster since user can ask for an equivalent number of job slots, functionality that parallel provides on a regular machine.

ADD REPLY • link 5.1 years ago by GenoMax 144k