Tutorial: How to download raw sequence data from GEO/SRA
gravatar for Obi Griffith
2.8 years ago by
Obi Griffith15k
Washington University, St Louis, USA
Obi Griffith15k wrote:

Suppose you want to download some raw sequence data in fastq format from GEO/SRA and run through an appropriate aligner (BWA, TopHat, STAR, etc) and then variant caller (Strelka, etc) or other analysis pipeline. How do you get started?  First, things first, you need the sequence data.

I will use the data released along with the following publication as an example:
Daemen A*, Griffith OL* et al. 2013. Modeling precision treatment of breast cancer. Genome Biology. 14:R110.

Data were deposited at GEO/SRA and are accessible through the GEO data set super-series for GSE48216 which is comprised of a sub-series for RNA-seq at GSE48213 and Exome-seq at GSE48215. From there you can link to the relevant SRA projects for RNA-seq at SRP026537 and Exome-seq at SRP026538.

You can download the raw data using the SRA toolkit. Please read:

For example, to get fastq files for the T47D exome cell line data you could do something like the following:
Find the appropriate GEO record for T47D from the GEO data set sub-series page for GSE48215 listed above.

Under 'Relations' is a link to the corresponding SRA page:

Note: You can also find this SRX record page directly from the SRA project page for SRP026538 listed above.

Determine the SRR number and then download the data at the command-line with:

prefetch -v SRR925811

Note where the sra file is downloaded (by default to /home/[USER]/ncbi/public/sra/.) and then convert to fastq with something like the following.

fastq-dump --outdir /opt/fastq/ --split-files /home/[USER]/ncbi/public/sra/SRR925811.sra

This should produce two fastq files (one for R1 and one for R2). That will give you the raw exome sequence data for the T47D cell line. A very similar process should work for any RNAseq samples that you want.

If you want to start with sam/bam files you can use sam-dump instead of fastq-dump. But note that these will still just contain the unaligned raw sequence data. You will still need to run through an aligner and variant caller.

If you just want to download X number of raw (fastq) reads to standard output from a particular run you can use a command like the following. This can be useful to just take a quick look at some reads, or obtain some reads for testing purposes or just check whether the SRA toolkit is even working for you.

fastq-dump -X 5 -Z SRR925811


bam download tutorial fastq sra geo • 46k views
ADD COMMENTlink modified 4 months ago by al-ash0 • written 2.8 years ago by Obi Griffith15k

Whoa!  How did I not know about prefetch - that is super handy.  Thanks for this tutorial.

ADD REPLYlink written 19 months ago by Josh Herr5.4k
gravatar for Istvan Albert
2.8 years ago by
Istvan Albert ♦♦ 71k
University Park, USA
Istvan Albert ♦♦ 71k wrote:

A few more tips here How To Download All Sra Samples At Once ?

I have recently needed the same functionality and came up with a one-liner that gets all the data from a BioProject. It requires Entrez Direct ( Ncbi Releases Entrez Direct, The Entrez Utilities On The Unix Command Line ) and SRA toolkit (although the former package could easily be replaced with simple wget commands)

This below will only download the first 5 datasets and only 10 spots from each as a demo, removing those limitations will get all 216 files and hundreds of millions of spots:

esearch -db sra -query PRJNA40075  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -5 | xargs fastq-dump -X 10 --split-files

More adventurous people might want to pipe the output into Gnu Parallel - Parallelize Serial Command Line Programs Without Changing Them instead of xargs thus fully saturating their  bandwith.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Istvan Albert ♦♦ 71k

Nice!  Thanks for adding this.

ADD REPLYlink written 2.8 years ago by Obi Griffith15k

from where I can get efetch... I have error using this command.. Please help me with this Regards, Bandana

ADD REPLYlink written 8 months ago by bandanaschapagain10

I am very new in this fiels. I got esearch but when I am doing esearch -db sra -query PRJNA281410 | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -5 | xargs ~/bin/sratoolkit.2.7.0-centos_linux64/bin/fastq-dump -X 10 --split-files It is giving me error like: param empty while validating argument list - expected accession PRJNA281410 I am using this Project ID Please Help me with this.

ADD REPLYlink written 8 months ago by bandanaschapagain10
gravatar for Tulip Nandu
2.4 years ago by
Tulip Nandu50
United States
Tulip Nandu50 wrote:

One option is to download the fastq file and directly follow the pipeline. On some exome sequencing data fastq dump doesnt work as efficiently or gives errors as the deposited data is in bam format and requires genome file. So from the below website we can directly download the fastq files for all sequencing experiments in GEO.


ADD COMMENTlink written 2.4 years ago by Tulip Nandu50
gravatar for Shicheng Guo
16 months ago by
Shicheng Guo4.4k
Shicheng Guo4.4k wrote:

How to change the default fold of the downloading for prefetch ? usually, the home directory is not big enougth for large number of SRA files?


Please be sure you sra-tools is always the lastest version or else some strange error maybe come out !

Eventually, I find we can re-set the cache directory by the following step:

1, come to bin directory, such as: /home/sratoolkit.2.5.5-centos_linux64/bin 
2, ./vdb-config -i 
3, change the fold of Workspace Name to a big harddisk.

Re-run fastq-dump, the cache files will come to new setted directory.

Remember: There will be large files in: /home/shg047/oasis/ncbi/public/sra/sra, Be sure to remove them periodically.

ADD COMMENTlink modified 6 months ago • written 16 months ago by Shicheng Guo4.4k

I also encountered this problem.

ADD REPLYlink written 16 months ago by zengxi.hada70
gravatar for macmath
15 months ago by
macmath110 wrote:

When I performed this

fastq-dump --outdir ~/Documents/USER/Re-analysisGPS2/fastq/ --split-files /home/USER/ncbi/public/sra/SRR1291261.sra

I did not receive 2 fastq file but only one file . Example: SRR1291261_1.fastq

Kindly suggest me what could be the reason? Is it because its not paired end?

ADD COMMENTlink modified 15 months ago • written 15 months ago by macmath110

You got a single file because this is a single end data set (see here as well: http://www.ebi.ac.uk/ena/data/view/SRR1291261 ).

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax29k
gravatar for Payal
5 months ago by
Payal10 wrote:

Addition:(Further Downstream RNA-Seq Analysis using GALAXY)

In case anybody wants to get files from SRA directly into GALAXY for further analysis, please check out this other video too, Video: RNA-Seq Alignment and Visualization using Galaxy and IGB. It explains how to get data from SRA directly into GALAXY using Tools in GALAXY itself.

I found this one useful so thought to share.

ADD COMMENTlink written 5 months ago by Payal10
gravatar for Obi Griffith
16 months ago by
Obi Griffith15k
Washington University, St Louis, USA
Obi Griffith15k wrote:

I'm guessing you need to use the vdb-config option: http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=vdb-config

You may need root privileges. It may also be possible to create a symlink of at the current path that points to a new path.

ADD COMMENTlink written 16 months ago by Obi Griffith15k
gravatar for Biojl
14 months ago by
Biojl1.5k wrote:

Hi, I modified a bit the code so I can download several experiments from a download_list.txt (sample+'\t'+SRXcode). Unfortunately it only works for the first item in the list. It's something in the loop because they work fine separately. Any ideas? Thanks.

# -*- coding: utf-8 -*-

while read p1 p2; do
#Get SRR
srr="$(esearch -db sra -query $p2 | efetch -format runinfo | cut -d ',' -f 1 | grep SRR)"
prefetch $srr && vdb-validate $srr && fastq-dump --split-files $srr
done <download_list.txt
ADD COMMENTlink modified 14 months ago • written 14 months ago by Biojl1.5k

When building any type of script make sure to debug it first by put an echo in front of each command. That way you will see what actually takes place. Also you need to post this as a new question, it is not an answer to the original tutorial.

ADD REPLYlink written 14 months ago by Istvan Albert ♦♦ 71k

I did echo de $srr and it's correct but the loop only works once. Also the prefetch/validate/fastqdump works for itself when provded srr instead of sra in the download_list. I don't think this is novel enough to start a new thread, it's an extension of your answer 19 months ago.

ADD REPLYlink written 14 months ago by Biojl1.5k

What I mean is to echo everything not just that one variable and verify that the commands appear as they should. The results of echoing the commands could then be executed via bash.

bash fancy_script_with_loops.sh > simple_commands.sh
bash  simple_commands.sh

What I am saying that there is no such thing that is works fine separately but not in the the script. Most likely when you generate it with the script is not the same thing that you think it is. Hence the solution is to make the script not execute the commands but write them all out and you can really see what it tries to do.

ADD REPLYlink modified 14 months ago • written 14 months ago by Istvan Albert ♦♦ 71k
gravatar for al-ash
4 months ago by
European Union
al-ash0 wrote:

Note that it might be necessary to convert the *.sra to *.fastq using some special parametres supplied to fastq-dump to make it suitable for Trinity assembly (to take care of the read names), i.e. for paired reads:

SRA_TOOLKIT/fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files file.sra

see https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-FAQ#ques_sra_fq_conversion

ADD COMMENTlink written 4 months ago by al-ash0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 444 users visited in the last hour