How to set sleep in GNU parallel in a esearch/efetch script
1
0
Entering edit mode
3.7 years ago
MAPK ★ 2.1k

I am requesting NCBI's data and looks like it only allows three requests per second, so I wanted to parallelize requests for three query ids ${IDLIST} per second. I would like to know how I can set sleep time of 2 seconds in this code. I know in a for-loop we can just do sleep 2, but what's the syntax to do this with parallel?

For example, If I just do for three IDs, like below (head -3 "${IDLIST}), the download request works:

  parallel -j1 \
  "IFS=$'\n';"'for hit in \
   $(esearch -db sra -query {} | efetch --format runinfo | grep SRR); do \
     echo "{},${hit}"; done' \
  ::: "$(head -3 "${IDLIST}")" \
  | sort -t, -k9,9rn >> out.csv

But won't work for:

parallel -j1 \
  "IFS=$'\n';"'for hit in \
   $(esearch -db sra -query {} | efetch --format runinfo | grep SRR); do \
     echo "{},${hit}"; done' \
  :::: "${IDLIST}" \
  | sort -t, -k9,9rn >> out.csv

Is there a way to limit three request per second in this code?

These are some IDLIST:

A-ADC-AD000037-BR-NCR-09AD14648
A-ADC-AD000044-BR-NCR-09AD14647
A-ADC-AD000068-BR-NCR-08AD8038
A-ADC-AD000075-BR-NCR-08AD9964
A-ADC-AD000092-BR-NCR-09AD13601
A-ADC-AD000096-BR-NCR-08AD9891
A-ADC-AD000097-BR-NCR-08AD9961
A-ADC-AD000104-BR-NCR-09AD14644
sra ncbi programming shell • 1.3k views
ADD COMMENT
1
Entering edit mode

it only allows three requests per second, so I wanted to parallelize requests for three query ids ${IDLIST} per second.

You are only going to make this worse. NCBI counts the queries per IP address. Have you signed up for NCBI_API_KEY? If not you should do that first. Ultimately NCBI counts number of requests per domain at a higher lever (if I recall right).

ADD REPLY
1
Entering edit mode

NCBI may have some of this information available in form of reports. Look around in ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/. You can download the files and parse the info locally, if you have a really large number of queries.

ADD REPLY
0
Entering edit mode

@genomax I couldn't find anything older than "NCBI_SRA_Metadata_20181202.tar.gz". I need this from 201802. I just created the api_key and exported the variable export api_key="key", but that still won't solve the problem. Where do I add this key? Thank you for your help.

ADD REPLY
0
Entering edit mode

Add KEY to your .bashrc file for automatic export or you can export it in your terminal where you are going to run the searches from each time. Export NCBI_API_KEY as the variable.

ADD REPLY
2
Entering edit mode
3.7 years ago
ole.tange ★ 4.4k

Something like this:

IDLIST=IDLIST

mysearch() {
    query="$1"
    IFS=$'\n'
    for hit in $(esearch -db sra -query "$query" |
                     efetch --format runinfo |
                     grep SRR); do
        echo "$query,${hit}"
    done
}
export -f mysearch

parallel -j0 --delay 0.34 mysearch :::: $IDLIST |
    sort -t, -k9,9rn >> out.csv

The magic is --delay 0.34 which will make sure a new job is at most started every 0.34 second.

ADD COMMENT

Login before adding your answer.

Traffic: 1430 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6