Tutorial: Fast download of FASTQ files from the European Nucleotide Archive (ENA)
43
gravatar for ATpoint
16 months ago by
ATpoint26k
Germany
ATpoint26k wrote:

As questions on how to retrieve published sequencing data fast and efficiently are posted here on Biostars quiet frequently, this little tutorial demonstrates how to perform bulk download of fastq files from the European Nucleitode Archive (ENA). Typically people ask on how to get a certain SRA file from NCBI and how to convert it to fastq. The common answer is prefetch followed by fastq-dump, but especially the latter is rather slow, so total file processing might take some time, especially if CPU (and disk) ressources are limited. Luckily, most published (and unrestricted) sequencing data are mirrored at the ENA directly in fastq format, and there is a simple and efficient way to retrieve them. In this tutorial, we will examplarily download an entire dataset of ChIP-seq and ATAC-seq data, requiring minimal preprocessing work. We will use the Aspera client for download rates of several tens of Mb/s up to few hundred Mb/s (depending on the connection, I/O capacity and distance to the download location). This example code should work on Linux and Mac.


--- last modified: 1.8.19 ::: Added link to https://github.com/wwood/ena-fast-download


Step-1: Get the Aspera client

Go to https://downloads.asperasoft.com/en/downloads/8?list and get the most recent installer for your system. For Linux, it is a tarball (use tar zxvf to unpack) with an installer batch script and for Mac, a standard disk image.

After installation, there now will be these executables/files in their default locations:

Linux:

$HOME/.aspera/connect/bin/ascp --- the executable

$HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh --- openssh file that we'll need later

Mac:

$HOME/Applications/Aspera\ Connect.app/Contents/Resources/ascp --- the executable

$HOME/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh --- openssh file that we'll need later

In any case, make sure you add the folder with the ascpexecutable to your PATH. If PATH is a new word to you, please google it ;-)


Step-2: Choose your dataset

Short way: If you only have an accession number: https://github.com/wwood/ena-fast-download from benjwoodcroft, see his answer to this thread below.

Alternatively, query ENA / NCBI manually to find datasets: Once you know which data you want to download, check if they are backed up on the ENA, which is true for most unrestricted data. For this tutorial, we will download the entire dataset from the ChIPmentation paper of 2015. When you check the paper for the NCBI accession, you'll find GSE70482. Following this link, you find the BioSample accession number PRJNA288801. So you go to the ENA, enter this PRJNA288801 in the search field and find a summary page with all available data for download. Scrolling down a bit, you see a table with accession numbers and all kinds of metadata. As typically we do not need most of these metadata, we use the field Select columns to select the essential metadata we need for the download, which are Study Accession, FASTQ files (FTP) and Experiment title. After selecting these, and unselecting everything else, you press TEXT and save the file as accessions.txt in your project folder.

Edit: 01/19: Also see sra-explorer : find SRA and FastQ download URLs in a couple of clicks from Phil Ewels for a nice interface to browse data on NCBI and ENA.


Select Columns ENA


accessions.txt


Step-3: Download the data

As you'll see in accessions.txt, the download paths direct you to the ENA ftp-server, which is rather slow. We want to download with the Aspera client (up to 200Mb/s at my workplace). Therefore, we awk around a bit to change the download paths to the era-fasp server. As you'll see in case of paired-end data, the paths to the two mate fastq files in accessions.txt are separated by semicolon, which we take into account. The output of this snippet is download.txt.

Linux:

awk 'FS="\t", OFS="\t" { gsub("ftp.sra.ebi.ac.uk", "era-fasp@fasp.sra.ebi.ac.uk:"); print }' accessions.txt | cut -f3 | awk -F ";" 'OFS="\n" {print $1, $2}' | awk NF | awk 'NR > 1, OFS="\n" {print "ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh" " " $1 " ."}' > download.txt

Mac:

awk 'FS="\t", OFS="\t" { gsub("ftp.sra.ebi.ac.uk", "era-fasp@fasp.sra.ebi.ac.uk:"); print }' accessions.txt | cut -f3 | awk -F ";" 'OFS="\n" {print $1, $2}' | awk NF | awk 'NR > 1, OFS="\n" {print "ascp -QT -l 300m -P33001 -i $HOME/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh" " " $1 " ."}' > download.txt

The output is a simple list of download commands using ascp.

output download.txt

That's it. Now, we only have to run the download commands.

Edit (23.07.18): The download paths are always like era-fasp@fasp.sra.ebi.ac.uk:/vol1(...). I point that out because of a recent post (328182) where OP accidentally forgot the ":" after the .ac.uk and used fasp@ instead of era-fasp@.

Lets download:

## Either by a simple loop:
while read LIST; do
$LIST; done < download.txt

## or by using GNU parallel to have things parallelized:
cat download.txt | parallel "{}"

Once the download is complete, one can play around using the accessions.txt to rename the files with e.g. information from the Experiment title field (column 2), or other metadata you may retrieve from ENA.


Edit 28.2.19: For matters of completeness, I also add a suggestion on how to get the same data from NCBI using prefetch and parallel-fastq-dump, a wrapper for fastq-dump from Renan Valieris for parallelized fastq conversion from sra files. Say one has a file IDs.txt which contains the SRA file IDs like:

SRRXXXXXX1
SRRXXXXXX2
(...)
SRRXXXXXXn

one can use this simple function to download SRA files via prefetch (please see the NCBI documention on how use Aspera with prefetch to avoid slow FTP downloads), followed by fastq conversion with parallel-fastq-dump.

function LoadDump {
  prefetch -O ./ -X 999999999 $1 

  if [[ -e ${1}.sra ]]; then
    parallel-fastq-dump -s ${1}.sra -t 8 -O ./ --tmpdir ./ --split-3 --gzip && rm ${1}.sra
  else
    echo '[ERROR]' $1 'apparently not successfully loaded' && exit 1
  fi
}; export -f LoadDump

cat IDs.txt | parallel -j 2 "LoadDump {}"

This would use 8 threads for fastq conversion and run two SRA files at a time via GNU parallel, hence requiring 16 threads. As always, scale up or down based on the available resources and potential I/O bottlenecks on your system.

ADD COMMENTlink modified 10 weeks ago by hermidalc0 • written 16 months ago by ATpoint26k
2

I recently was downloading the data form CCLE experiment and it was taking ages (also crushing more than once) with sra-toolkit and fastq-dump. I used your approach, slightly it modifying and it worked wonders! Thanks a lot!

My modification below:

esearch -db sra -query PRJNA523380 | efetch --format runinfo |  grep ${tissue_of_interest} | grep ${experiment} | cut -f1 -d',' | xargs | sed 's/ / OR /g' | xclip -selection c

where ${tissue_of_interest} and ${experiment} where variables I set up specifically to my needs (i.e. CERVIX, RNA-seq). I copied this into sra-explorer. The Project has too many files to directly search sra-explorer with its ID.

ADD REPLYlink modified 4 months ago • written 4 months ago by kzkedzierska160
1

Glad to hear it is used productively :)

ADD REPLYlink written 4 months ago by ATpoint26k
1

Good work! I once had the pleasure to use fastq-dump on whole-genome data. I was cursing in multiple languages! :D

ADD REPLYlink modified 16 months ago • written 16 months ago by Eric Lim1.6k

Good work dude. Will use this next time I need to get data from ENA.

ADD REPLYlink written 16 months ago by Kevin Blighe51k

Thank you Sir, glad to help!

ADD REPLYlink written 16 months ago by ATpoint26k

Thank for the tutorial

ADD REPLYlink written 3 months ago by eraheris0

Has anyone noticed why, for the same run, FASTQ files downloaded from ENA are not the same size as those produced from SRA? Is there an obvious reason why that I might've missed?

My SRA procedure:

prefetch SRR8112647
vdb-validate SRR8112647.sra
parallel-fastq-dump --threads 12 --sra-id SRR8112647.sra --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip

The fastq-dump options above are definitely recommended to get the kind of FASTQs from .sra files that are ready for mapping (see e.g. https://edwards.sdsu.edu/research/fastq-dump/).

If I gunzip those files and compare them to the gunzipped ENA ones they aren't same. Which source is the correct one?

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by hermidalc0

From ENA:

pigz -c -d -p 8 SRR8112647_1.fastq.gz | wc -l
273524936

From SRA with your command:

pigz -c -d -p 8 SRR8112647_pass_1.fastq.gz | wc -l
273524936

I don't see any difference. Can you be more specific what is "different" between them? Just use any of them. Don't overthink things. Just download the ENA one and start your alignment (or whatever you plan to do).

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by ATpoint26k

You counted lines, that's not going to tell you everything. The sizes are very different, and I thought maybe because ENA used higher compression but in my previous post the uncompressed are also very different in size.

4613823652 Sep  6 18:25 SRR8112647_1.fastq.gz
5088093136 Sep  6 18:18 SRR8112647_pass_1.fastq.gz

And if you check md5 instead of lines:

pigz -c -d -p 8 SRR8112647_pass_1.fastq.gz | md5sum
5b4365ef3897dffefe3e739572d2583c  -
pigz -c -d -p 8 SRR8112647_1.fastq.gz | md5sum
1e420a8f7e068b739ea38dbbac402bf2  -

I don't think it's overthinking to wonder why they aren't identical, since ENA is getting these data from SRA anyway, so aren't they doing the exact steps we just did and simply saving others time? Then they should've gotten the exact same result. Maybe they are doing some other processing of the files, would like to know what it might be.

ADD REPLYlink modified 10 weeks ago • written 10 weeks ago by hermidalc0

Found the answer, which I suspected when I was writing back, that they modified the FASTQ entry headers:

$ pigz -c -d -p 8 SRR8112647_1.fastq.gz | head -20
@SRR8112647.1 1/1
GNGACACAGAGGTCCAGCCCCAGAACTTGTAAGGATTTTGTTTGAACACTGAGCAGATGCCTCCTCCCTGCCAACCATCACACTAGTTAGGGCTGGCCATGAATTCTATGCCAGAGTCACTCCTNCAGTCTGCTAGGGGTGAGCCTTCTT
+
A#AAAAAAAAAAAFJF<AA7AAFJJFFFF7FFJAFAAJFJFFFF-A-F7-FF-<F7FF<<AF<AAJFFAFF-A---<7-<7AA7FF-AJAJ7AAAF7F77<FJJF-FFFAJ-AAJ<-<<777-<#A<777-7--A<A-7-A---7F-7--


$ pigz -c -d -p 8 SRR8112647_pass_1.fastq.gz | head -20
@SRR8112647.1.1 1 length=150
GNGACACAGAGGTCCAGCCCCAGAACTTGTAAGGATTTTGTTTGAACACTGAGCAGATGCCTCCTCCCTGCCAACCATCACACTAGTTAGGGCTGGCCATGAATTCTATGCCAGAGTCACTCCTNCAGTCTGCTAGGGGTGAGCCTTCTT
+SRR8112647.1.1 1 length=150
A#AAAAAAAAAAAFJF<AA7AAFJJFFFF7FFJAFAAJFJFFFF-A-F7-FF-<F7FF<<AF<AAJFFAFF-A---<7-<7AA7FF-AJAJ7AAAF7F77<FJJF-FFFAJ-AAJ<-<<777-<#A<777-7--A<A-7-A---7F-7--
ADD REPLYlink modified 10 weeks ago by ATpoint26k • written 10 weeks ago by hermidalc0
4
gravatar for benjwoodcroft
3 months ago by
benjwoodcroft110
benjwoodcroft110 wrote:

Hi,

This is a very helpful post - thanks a lot for writing it. I wrote a simple Python script based on this which automates things so you only need to provide a run identifier as an argument and it works out the rest - hopefully someone will find it useful.

https://github.com/wwood/ena-fast-download

ADD COMMENTlink written 3 months ago by benjwoodcroft110
2

Cool, very useful if you want to batch-query many accession numbers. You might want to add an option like --linux and --osx to output the correct path to the default aspera openssh file as default paths are a bit different in both operating systems. If you want to search NCBI also check out sra-explorer : find SRA and FastQ download URLs in a couple of clicks from Phil Ewels which has an option to print ENA links directly.

ADD REPLYlink written 3 months ago by ATpoint26k
1

Good idea - I added an --ssh_key option along those lines. I've not had a chance to test it on OSX though - if that is straightforward for you would you mind giving it a crack please?

ADD REPLYlink written 3 months ago by benjwoodcroft110

Works like a charm! Just a small thing, maybe add a kind of exists(--ssh_key) an isinPATH (ascp) (no clue what the command in python are :-D ) option that stops the run right away if the key does not exist. Everything else works really nice!

ADD REPLYlink modified 3 months ago • written 3 months ago by ATpoint26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2530 users visited in the last hour