Question

How to download RNA-seq dataset (fastq.gz files) from GEO and SRA databases?

0

Entering edit mode

4.8 years ago

Farah ▴ 80

Hello,

I need to download some RNA-seq fastq.gz files from both GEO and SRA databases. May I know how can I download these datasets from them?

Thank you very much.

Best wishes

fastq SRA RNA-Seq GEO • 6.6k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 4.8 years ago by Farah ▴ 80

score 2 · Answer 1 · 2019-07-04

2

Entering edit mode

4.8 years ago

ATpoint 81k

Read and follow Fast download of FASTQ files from the European Nucleotide Archive (ENA), it covers two alternative strategies, either using the SRAtoolkit or direct download of fastq files from the European Nucleotide Repository. Agree with Benn though that the search function would've directed you to many previous threads on that ;-)

ADD COMMENT • link 4.8 years ago by ATpoint 81k

0

Entering edit mode

Thank you very much for the useful tutorial link. I followed the tutorial steps to download (GSE111653 dataset with BioSample accession number of PRJNA437670). First, I downloaded tarball file from Aspera client and then I ran tar zxvf /scratch/user/ye/ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.tar.gz on linux.

Then, after downloading PRJNA437670.txt file from ENA, I ran the below command: $ awk 'FS="\t", OFS="\t" { gsub("ftp.sra.ebi.ac.uk", "era-fasp@fasp.sra.ebi.ac.uk:"); print }' /scratch/user/ye/PRJNA437670.txt | cut -f3 | awk -F ";" 'OFS="\n" {print $1, $2}' | awk NF | awk 'NR > 1, OFS="\n" {print "ascp -QT -l 300m -P33001 -i $HOME/.aspera/connect/etc/asperaweb_id_dsa.openssh" " " $1 " ."}' > download.txt

So, now, I have only 4 files in my /scratch/user/ye/ directory as follows:

download.txt ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.sh ibm-aspera-connect-3.9.5.172984-linux-g2.12-64.tar.gz PRJNA437670.txt

I then ran the below command to download the data: $ cat /scratch/user/ye/download.txt | parallel "{}"

However, I faced with the following ERROR:

Academic tradition requires you to cite works you base your article on. When using programs that use GNU Parallel to process data for publication please cite: O. Tange (2011): GNU Parallel - The Command-Line Power Tool, ;login: The USENIX Magazine, February 2011:42-47.

This helps funding further development; AND IT WON'T COST YOU A CENT. If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

To silence the citation notice: run 'parallel --citation'. Can't exec "/bin/sh": Argument list too long at /local/software/biobuilds/2017.11/bin/parallel line 3981. . . Can't exec "/bin/sh": Argument list too long at /local/software/biobuilds/2017.11/bin/parallel line 3981. /bin/bash: ascp: command not found /bin/bash: ascp: command not found . . /bin/bash: ascp: command not found Use of uninitialized value $opt::termseq in split at /local/software/biobuilds/2017.11/bin/parallel line 3608, <stdin> line 128.

Also, I tried:

$ while read LIST; do $LIST; done < /scratch/user/ye/download.txt

And I got many -bash: ascp: command not found messages

Would you please help me what I did wrong and how to fix it? Thank you very much.

ADD REPLY • link 4.8 years ago by Farah ▴ 80

score 1 · Answer 2 · 2019-07-04

1

Entering edit mode

4.8 years ago

Benn 8.3k

Please read previous posts first: How to download raw sequence data from GEO/SRA

ADD COMMENT • link 4.8 years ago by Benn 8.3k