Question: How to set sleep in GNU parallel in a esearch/efetch script
0
gravatar for MAPK
7 weeks ago by
MAPK1.6k
MAPK1.6k wrote:

I am requesting NCBI's data and looks like it only allows three requests per second, so I wanted to parallelize requests for three query ids ${IDLIST} per second. I would like to know how I can set sleep time of 2 seconds in this code. I know in a for-loop we can just do sleep 2, but what's the syntax to do this with parallel?

For example, If I just do for three IDs, like below (head -3 "${IDLIST}), the download request works:

  parallel -j1 \
  "IFS=$'\n';"'for hit in \
   $(esearch -db sra -query {} | efetch --format runinfo | grep SRR); do \
     echo "{},${hit}"; done' \
  ::: "$(head -3 "${IDLIST}")" \
  | sort -t, -k9,9rn >> out.csv

But won't work for:

parallel -j1 \
  "IFS=$'\n';"'for hit in \
   $(esearch -db sra -query {} | efetch --format runinfo | grep SRR); do \
     echo "{},${hit}"; done' \
  :::: "${IDLIST}" \
  | sort -t, -k9,9rn >> out.csv

Is there a way to limit three request per second in this code?

These are some IDLIST:

A-ADC-AD000037-BR-NCR-09AD14648
A-ADC-AD000044-BR-NCR-09AD14647
A-ADC-AD000068-BR-NCR-08AD8038
A-ADC-AD000075-BR-NCR-08AD9964
A-ADC-AD000092-BR-NCR-09AD13601
A-ADC-AD000096-BR-NCR-08AD9891
A-ADC-AD000097-BR-NCR-08AD9961
A-ADC-AD000104-BR-NCR-09AD14644
programming shell sra ncbi • 135 views
ADD COMMENTlink modified 6 weeks ago by ole.tange3.9k • written 7 weeks ago by MAPK1.6k
1

it only allows three requests per second, so I wanted to parallelize requests for three query ids ${IDLIST} per second.

You are only going to make this worse. NCBI counts the queries per IP address. Have you signed up for NCBI_API_KEY? If not you should do that first. Ultimately NCBI counts number of requests per domain at a higher lever (if I recall right).

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by genomax89k
1

NCBI may have some of this information available in form of reports. Look around in ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Metadata/. You can download the files and parse the info locally, if you have a really large number of queries.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by genomax89k

@genomax I couldn't find anything older than "NCBI_SRA_Metadata_20181202.tar.gz". I need this from 201802. I just created the api_key and exported the variable export api_key="key", but that still won't solve the problem. Where do I add this key? Thank you for your help.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by MAPK1.6k

Add KEY to your .bashrc file for automatic export or you can export it in your terminal where you are going to run the searches from each time. Export NCBI_API_KEY as the variable.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by genomax89k
2
gravatar for ole.tange
6 weeks ago by
ole.tange3.9k
Denmark
ole.tange3.9k wrote:

Something like this:

IDLIST=IDLIST

mysearch() {
    query="$1"
    IFS=$'\n'
    for hit in $(esearch -db sra -query "$query" |
                     efetch --format runinfo |
                     grep SRR); do
        echo "$query,${hit}"
    done
}
export -f mysearch

parallel -j0 --delay 0.34 mysearch :::: $IDLIST |
    sort -t, -k9,9rn >> out.csv

The magic is --delay 0.34 which will make sure a new job is at most started every 0.34 second.

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by ole.tange3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1636 users visited in the last hour