Tutorial: Fast download of FASTQ files from the European Nucleotide Archive (ENA)
27
gravatar for ATpoint
12 months ago by
ATpoint19k
Germany
ATpoint19k wrote:

As questions on how to retrieve published sequencing data fast and efficiently are posted here on Biostars quiet frequently, this little tutorial demonstrates how to perform bulk download of fastq files from the European Nucleitode Archive (ENA). Typically people ask on how to get a certain SRA file from NCBI and how to convert it to fastq. The common answer is prefetch followed by fastq-dump, but especially the latter is rather slow, so total file processing might take some time, especially if CPU (and disk) ressources are limited. Luckily, most published (and unrestricted) sequencing data are mirrored at the ENA directly in fastq format, and there is a simple and efficient way to retrieve them. In this tutorial, we will examplarily download an entire dataset of ChIP-seq and ATAC-seq data, requiring minimal preprocessing work. We will use the Aspera client for download rates of several tens of Mb/s up to few hundred Mb/s (depending on the connection, I/O capacity and distance to the download location). This example code should work on Linux and Mac.


--- last modified: 11.7.19 ::: Explicitely mentioned that ascp executable has to be in PATH.


Step-1: Get the Aspera client

Go to https://downloads.asperasoft.com/en/downloads/8?list and get the most recent installer for your system. For Linux, it is a tarball (use tar zxvf to unpack) with an installer batch script and for Mac, a standard disk image.

After installation, there now will be these executables/files in their default locations:

Linux:

$HOME/.aspera/connect/bin/ascp --- the executable

$HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh --- openssh file that we'll need later

Mac:

$HOME/Applications/Aspera\ Connect.app/Contents/Resources/ascp --- the executable

$HOME/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh --- openssh file that we'll need later

In any case, make sure you add the folder with the ascpexecutable to your PATH. If PATH is a new word to you, please google it ;-)


Step-2: Choose your dataset

Once you know which data you want to download, check if they are backed up on the ENA, which is true for most unrestricted data. For this tutorial, we will download the entire dataset from the ChIPmentation paper of 2015. When you check the paper for the NCBI accession, you'll find GSE70482. Following this link, you find the BioSample accession number PRJNA288801. So you go to the ENA, enter this PRJNA288801 in the search field and find a summary page with all available data for download. Scrolling down a bit, you see a table with accession numbers and all kinds of metadata. As typically we do not need most of these metadata, we use the field Select columns to select the essential metadata we need for the download, which are Study Accession, FASTQ files (FTP) and Experiment title. After selecting these, and unselecting everything else, you press TEXT and save the file as accessions.txt in your project folder.

Edit: 01/19: Also see sra-explorer : find SRA and FastQ download URLs in a couple of clicks from Phil Ewels for a nice interface to browse data on NCBI and ENA.


Select Columns ENA


accessions.txt


Step-3: Download the data

As you'll see in accessions.txt, the download paths direct you to the ENA ftp-server, which is rather slow. We want to download with the Aspera client (up to 200Mb/s at my workplace). Therefore, we awk around a bit to change the download paths to the era-fasp server. As you'll see in case of paired-end data, the paths to the two mate fastq files in accessions.txt are separated by semicolon, which we take into account. The output of this snippet is download.txt.

Linux:

awk 'FS="\t", OFS="\t" { gsub("ftp.sra.ebi.ac.uk", "era-fasp@fasp.sra.ebi.ac.uk:"); print }' accessions.txt | cut -f3 | awk -F ";" 'OFS="\n" {print $1, $2}' | awk NF | awk 'NR > 1, OFS="\n" {print "ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh" " " $1 " ."}' > download.txt

Mac:

awk 'FS="\t", OFS="\t" { gsub("ftp.sra.ebi.ac.uk", "era-fasp@fasp.sra.ebi.ac.uk:"); print }' accessions.txt | cut -f3 | awk -F ";" 'OFS="\n" {print $1, $2}' | awk NF | awk 'NR > 1, OFS="\n" {print "ascp -QT -l 300m -P33001 -i $HOME/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh" " " $1 " ."}' > download.txt

The output is a simple list of download commands using ascp.

output download.txt

That's it. Now, we only have to run the download commands.

Edit (23.07.18): The download paths are always like era-fasp@fasp.sra.ebi.ac.uk:/vol1(...). I point that out because of a recent post (328182) where OP accidentally forgot the ":" after the .ac.uk and used fasp@ instead of era-fasp@.

Lets download:

## Either by a simple loop:
while read LIST; do
$LIST; done < download.txt

## or by using GNU parallel to have things parallelized:
cat download.txt | parallel "{}"

Once the download is complete, one can play around using the accessions.txt to rename the files with e.g. information from the Experiment title field (column 2), or other metadata you may retrieve from ENA.


Edit 28.2.19: For matters of completeness, I also add a suggestion on how to get the same data from NCBI using prefetch and parallel-fastq-dump, a wrapper for fastq-dump from Renan Valieris for parallelized fastq conversion from sra files. Say one has a file IDs.txt which contains the SRA file IDs like:

SRRXXXXXX1
SRRXXXXXX2
(...)
SRRXXXXXXn

one can use this simple function to download SRA files via prefetch (please see the NCBI documention on how use Aspera with prefetch to avoid slow FTP downloads), followed by fastq conversion with parallel-fastq-dump.

function LoadDump {
  prefetch -O ./ -X 999999999 $1 

  if [[ -e ${1}.sra ]]; then
    parallel-fastq-dump -s ${1}.sra -t 8 -O ./ --tmpdir ./ --split-3 --gzip && rm ${1}.sra
  else
    echo '[ERROR]' $1 'apparently not successfully loaded' && exit 1
  fi
}; export -f LoadDump

cat IDs.txt | parallel -j 2 "LoadDump {}"

This would use 8 threads for fastq conversion and run two SRA files at a time via GNU parallel, hence requiring 16 threads. As always, scale up or down based on the available resources and potential I/O bottlenecks on your system.

ADD COMMENTlink modified 8 days ago • written 12 months ago by ATpoint19k
1

Good work! I once had the pleasure to use fastq-dump on whole-genome data. I was cursing in multiple languages! :D

ADD REPLYlink modified 12 months ago • written 12 months ago by Eric Lim1.4k
1

I recently was downloading the data form CCLE experiment and it was taking ages (also crushing more than once) with sra-toolkit and fastq-dump. I used your approach, slightly it modifying and it worked wonders! Thanks a lot!

My modification below:

esearch -db sra -query PRJNA523380 | efetch --format runinfo |  grep ${tissue_of_interest} | grep ${experiment} | cut -f1 -d',' | xargs | sed 's/ / OR /g' | xclip -selection c

where ${tissue_of_interest} and ${experiment} where variables I set up specifically to my needs (i.e. CERVIX, RNA-seq). I copied this into sra-explorer. The Project has too many files to directly search sra-explorer with its ID.

ADD REPLYlink modified 10 days ago • written 10 days ago by kzkedzierska150
1

Glad to hear it is used productively :)

ADD REPLYlink written 10 days ago by ATpoint19k

Good work dude. Will use this next time I need to get data from ENA.

ADD REPLYlink written 12 months ago by Kevin Blighe45k

I've recently started to download FASTQ files via Aspera, but I am using ena-file-downloader.jar. That's is the github link and it is also accessible from "Bulk Download Files" button in the same website. It has GUI, you can choose between FTP and Aspera and you can specify Aspera parameters. Do you think that the speed would be different between the applicaton and the terminal?

ADD REPLYlink written 4 months ago by Batu150
1

They probably both use the same Aspera server so speed is probably similar. Question would be if this tool you mention allows parallel downloads of several files.

ADD REPLYlink written 4 months ago by ATpoint19k

Thank you very much for this useful tutorial. I followed the tutorial steps to download (GSE111653 dataset with BioSample accession number of PRJNA437670). First, I downloaded tarball file from Aspera client and then I ran tar zxvf /scratch/user/ye/ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.tar.gz on linux.

Then, after downloading PRJNA437670.txt file from ENA, I ran the below command: $ awk 'FS="\t", OFS="\t" { gsub("ftp.sra.ebi.ac.uk", "era-fasp@fasp.sra.ebi.ac.uk:"); print }' /scratch/user/ye/PRJNA437670.txt | cut -f3 | awk -F ";" 'OFS="\n" {print $1, $2}' | awk NF | awk 'NR > 1, OFS="\n" {print "ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh" " " $1 " ."}' > download.txt

So, now, I have only 4 files in my /scratch/user/ye/ directory as follows:

download.txt ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.sh ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.tar.gz PRJNA437670.txt

I then ran the below command to download the data: $ cat /scratch/user/ye/download.txt | parallel "{}"

However, I faced with the following ERROR:

Academic tradition requires you to cite works you base your article on. When using programs that use GNU Parallel to process data for publication please cite: O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence the citation notice: run 'parallel --citation'. Can't exec "/bin/sh": Argument list too long at /local/software/biobuilds/2017.11/bin/parallel line 3981. . . Can't exec "/bin/sh": Argument list too long at /local/software/biobuilds/2017.11/bin/parallel line 3981. /bin/bash: ascp: command not found /bin/bash: ascp: command not found . . /bin/bash: ascp: command not found Use of uninitialized value $opt::termseq in split at /local/software/biobuilds/2017.11/bin/parallel line 3608, <stdin> line 128.

Also, I tried:

$ while read LIST; do $LIST; done < /scratch/user/ye/download.txt

And I got many -bash: ascp: command not found messages

Would you please help me what I did wrong and how to fix it? Thank you very much.

ADD REPLYlink written 10 days ago by F. Golestan10

Did you run the installer script for Aspera? If not, do so.

chmod +x ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.sh && ./ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.sh

About the parallel command I cannot tell from here, maybe simply use a loop to download the files or install parallel again (or via conda)

ADD REPLYlink modified 10 days ago • written 10 days ago by ATpoint19k

Thanks for your reply. As you suggested, I ran the installer script for Aspera as below:

chmod +x /scratch/user/ye/ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.sh && /scratch/user/ye/ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.sh

Installing IBM Aspera Connect
Deploying IBM Aspera Connect (/home/user/.aspera/connect) for the current user only.
Install complete.

Then, I ran:

awk 'FS="\t", OFS="\t" { gsub("ftp.sra.ebi.ac.uk", "era-fasp@fasp.sra.ebi.ac.uk:"); print }' /scratch/user/ye/PRJNA437670.txt | cut -f3 | awk -F ";" 'OFS="\n" {print $1, $2}' | awk NF | awk 'NR > 1, OFS="\n" {print "ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh" " " $1 " ."}' > download.txt

Then $ while read LIST; do $LIST; done < /scratch/user/download.txt But, again -bash: ascp: command not found

and again cat /scratch/user/download.txt | parallel "{}"

-bash: parallel: command not found

I would highly appreciate your help.

ADD REPLYlink modified 8 days ago by Kevin Blighe45k • written 8 days ago by F. Golestan10
1

You can likely access the Aspera binary explicitly via

/home/user/.aspera/connect/bin/ascp

Maybe add this to your PATH variable, or modify the awk command to include this full path.

For parallel to work, you will need to install GNU parallel

ADD REPLYlink modified 8 days ago • written 8 days ago by Kevin Blighe45k
1

Yep, has to be in PATH or call it explicitely as Kevin Blighe says. Same goes for parallel. If not done already, install it or use the loop. In either case ascp has to be in PATH or called explicitely.

ADD REPLYlink written 8 days ago by ATpoint19k

Thanks for your guide. I did as below, But, again -bash: ascp: command not found:

$ tar zxvf /scratch/user/ye/ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.tar.gz

$ export PATH="/home/user/.aspera/connect/bin/ascp":$PATH

$ chmod +x /scratch/user/ye/ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.sh && /scratch/user/ye/ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.sh

Installing IBM Aspera Connect
Deploying IBM Aspera Connect (/home/user/.aspera/connect) for the current user only.
Install complete.


$ awk 'FS="\t", OFS="\t" { gsub("ftp.sra.ebi.ac.uk", "era-fasp@fasp.sra.ebi.ac.uk:"); print }' /scratch/user/ye/PRJNA437670.txt | cut -f3 | awk -F ";" 'OFS="\n" {print $1, $2}' | awk NF | awk 'NR > 1, OFS="\n" {print "ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh" " " $1 " ."}' > download.txt

$ while read LIST; do $LIST; done < /home/fsgh1d18/download.txt
-bash: ascp: command not found

I can not see what I am doing wrong or missing. I really need your help to fix this problem. Many thanks.

ADD REPLYlink modified 2 days ago by finswimmer11k • written 2 days ago by F. Golestan10

I also wanted to add that I am using my university cluster which has a linux system. Many thanks.

ADD REPLYlink written 2 days ago by F. Golestan10
1

Remove the ascp from export PATH="/home/user/.aspera/connect/bin/ascp":$PATH it is only export PATH=/home/user/.aspera/connect/bin/:$PATH also without any quotation marks.

The concept of PATH is that when entering a command (tool name) the system scans all folders in PATH for the presence of the called tool/executable. Therefore you only have to add the folder where ascp is in, not the full path to the executable.

ADD REPLYlink modified 2 days ago • written 2 days ago by ATpoint19k
1

Thank you so much for your great help. I removed ascp from export command as you suggested. I also used -i /home/fsgh1d18/.aspera/connect/etc/asperaweb_id_dsa.openssh" " " $1 " ."}' > download.txt in awk command. Now it is perfectly working. Many thanks.

ADD REPLYlink written 1 day ago by F. Golestan10

Glad to help, if there are other issues feel free to ask :)

ADD REPLYlink written 6 hours ago by ATpoint19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1878 users visited in the last hour