Suppose you want to download some raw sequence data in fastq format from GEO/SRA and run through an appropriate aligner (BWA, TopHat, STAR, etc) and then variant caller (Strelka, etc) or other analysis pipeline. How do you get started? First, things first, you need the sequence data.
I will use the data released along with the following publication as an example: Daemen A, Griffith OL et al. 2013. Modeling precision treatment of breast cancer. Genome Biology. 14:R110.
Data were deposited at GEO/SRA and are accessible through the GEO data set super-series for GSE48216 which is comprised of a sub-series for RNA-seq at GSE48213 and Exome-seq at GSE48215. From there you can link to the relevant SRA projects for RNA-seq at SRP026537 and Exome-seq at SRP026538.
You can download the raw data using the SRA toolkit. Please read:
- http://www.ncbi.nlm.nih.gov/books/NBK47540/
- http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc
For example, to get fastq files for the T47D exome cell line data you could do something like the following:
Find the appropriate GEO record for T47D from the GEO data set sub-series page for GSE48215 listed above. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1173000
Under 'Relations' is a link to the corresponding SRA page: http://www.ncbi.nlm.nih.gov/sra?term=SRX317818
Note: You can also find this SRX record page directly from the SRA project page for SRP026538 listed above.
Determine the SRR number and then download the data at the command-line with:
prefetch -v SRR925811
Note where the sra file is downloaded (by default to /home/[USER]/ncbi/public/sra/
.) and then convert to fastq with something like the following.
fastq-dump --outdir /opt/fastq/ --split-files /home/[USER]/ncbi/public/sra/SRR925811.sra
This should produce two fastq files (one for R1 and one for R2). That will give you the raw exome sequence data for the T47D cell line. A very similar process should work for any RNAseq samples that you want.
If you want to start with sam/bam files you can use sam-dump instead of fastq-dump. But note that these will still just contain the unaligned raw sequence data. You will still need to run through an aligner and variant caller. http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=sam-dump
If you just want to download X number of raw (fastq) reads to standard output from a particular run you can use a command like the following. This can be useful to just take a quick look at some reads, or obtain some reads for testing purposes or just check whether the SRA toolkit is even working for you.
fastq-dump -X 5 -Z SRR925811
Whoa! How did I not know about
prefetch
- that is super handy. Thanks for this tutorial.I'm guessing you need to use the vdb-config option: http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=vdb-config
You may need root privileges. It may also be possible to create a symlink of at the current path that points to a new path.
Hi, how do we put in a dbgap crendentials? I tried:
but I'm getting this error.
thanks.
How to download sra data for example SRP134715?Does it have problem?
Download the fastq files directly from EBI-ENA here.
Without details on what problems you have it is impossible to help.
Hi
I am new to SRA toolkit and I am having some issue. Can you please help me?
I have downloaded SRA tollkit on my laptop and I am in the bin directory. I have configure the toolkit to download the files into a server
./Volumes/..../...../..../..../ncbi/public
.I have used
and the relative SRA file is in the public/sra folder in the server (as expected)
Now here the problem:
I have used:
and I got this error: badly formed UTF-8 character
I also tried without specifying the outdir
and I got:
What can be the problem/s?
Thanks
I tried to use the SRA toolkit a few months ago, I succesfully installed it and used
prefetch
to get the data, as per their tutorial, but when I wanted to usefastq-dump
, I got an error saying "command not found", which was super-puzzling. So that's where I ended 😕️ Still don't know what happened, but I may give it another shot.What if I want to use fastq-dump to get the final 5 reads of this data??