8
189
Entering edit mode
8.1 years ago

Suppose you want to download some raw sequence data in fastq format from GEO/SRA and run through an appropriate aligner (BWA, TopHat, STAR, etc) and then variant caller (Strelka, etc) or other analysis pipeline. How do you get started? First, things first, you need the sequence data.

I will use the data released along with the following publication as an example: Daemen A, Griffith OL et al. 2013. Modeling precision treatment of breast cancer. Genome Biology. 14:R110.

Data were deposited at GEO/SRA and are accessible through the GEO data set super-series for GSE48216 which is comprised of a sub-series for RNA-seq at GSE48213 and Exome-seq at GSE48215. From there you can link to the relevant SRA projects for RNA-seq at SRP026537 and Exome-seq at SRP026538.

For example, to get fastq files for the T47D exome cell line data you could do something like the following:

Find the appropriate GEO record for T47D from the GEO data set sub-series page for GSE48215 listed above. http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1173000

Under 'Relations' is a link to the corresponding SRA page: http://www.ncbi.nlm.nih.gov/sra?term=SRX317818

Note: You can also find this SRX record page directly from the SRA project page for SRP026538 listed above.

Determine the SRR number and then download the data at the command-line with:

prefetch -v SRR925811


Note where the sra file is downloaded (by default to /home/[USER]/ncbi/public/sra/.) and then convert to fastq with something like the following.

fastq-dump --outdir /opt/fastq/ --split-files /home/[USER]/ncbi/public/sra/SRR925811.sra


This should produce two fastq files (one for R1 and one for R2). That will give you the raw exome sequence data for the T47D cell line. A very similar process should work for any RNAseq samples that you want.

If you want to start with sam/bam files you can use sam-dump instead of fastq-dump. But note that these will still just contain the unaligned raw sequence data. You will still need to run through an aligner and variant caller. http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=sam-dump

If you just want to download X number of raw (fastq) reads to standard output from a particular run you can use a command like the following. This can be useful to just take a quick look at some reads, or obtain some reads for testing purposes or just check whether the SRA toolkit is even working for you.

fastq-dump -X 5 -Z SRR925811

1
Entering edit mode

Whoa! How did I not know about prefetch - that is super handy. Thanks for this tutorial.

0
Entering edit mode

I'm guessing you need to use the vdb-config option: http://www.ncbi.nlm.nih.gov/Traces/sra/?view=toolkit_doc&f=vdb-config

You may need root privileges. It may also be possible to create a symlink of at the current path that points to a new path.

0
Entering edit mode

Hi, how do we put in a dbgap crendentials? I tried:

prefetch -v SRR617345


but I'm getting this error.

err: query unauthorized while resolving tree within virtual file system module - failed to resolve accession 'SRR617345' - Access denied - please request permission to access phs000468/PCR in dbGaP ( 403 )


thanks.

0
Entering edit mode

2
Entering edit mode

0
Entering edit mode

Without details on what problems you have it is impossible to help.

0
Entering edit mode

Hi

I am new to SRA toolkit and I am having some issue. Can you please help me?

I have downloaded SRA tollkit on my laptop and I am in the bin directory. I have configure the toolkit to download the files into a server ./Volumes/..../...../..../..../ncbi/public.

I have used

./prefetch SRR6294675


and the relative SRA file is in the public/sra folder in the server (as expected)

Now here the problem:

I have used:

/fastq-dump -–outdir ./Volumes/..../...../..../..../ncbi/public/opt/fastq –-split-files  ./Volumes/..../...../..../..../ncbi/public/sra/SRR6294675.sra


and I got this error: badly formed UTF-8 character

I also tried without specifying the outdir

./fastq-dump –-split-files ./Volumes/..../...../..../..../ncbi/public/sra/SRR6294675.sra


and I got:

error unexpected while resolving query within virtual file system module - No accession to process ( 500 )
Failed to call external services.


What can be the problem/s?

Thanks

0
Entering edit mode

I tried to use the SRA toolkit a few months ago, I succesfully installed it and used prefetch to get the data, as per their tutorial, but when I wanted to use fastq-dump, I got an error saying "command not found", which was super-puzzling. So that's where I ended 😕️ Still don't know what happened, but I may give it another shot.

0
Entering edit mode

What if I want to use fastq-dump to get the final 5 reads of this data??

49
Entering edit mode
8.1 years ago

A few more tips here

I have recently needed the same functionality and came up with a one-liner that gets all the data from a BioProject. It requires Entrez Direct (Ncbi Releases Entrez Direct, The Entrez Utilities On The Unix Command Line ) and SRA toolkit (although the former package could easily be replaced with simple wget commands)

This below will only download the first 5 datasets and only 10 spots from each as a demo, removing those limitations will get all 216 files and hundreds of millions of spots:

esearch -db sra -query PRJNA40075  | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -5 | xargs fastq-dump -X 10 --split-files


More adventurous people might want to pipe the output into parallels instead of xargs thus fully saturating their bandwith.

2
Entering edit mode

This line works really well, thank you!

A note on parallel; you can just add the -n 1 -P $nCores arguments to xargs. For example I used: esearch -db sra -query PRJNA515945 | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs -n 1 -P 12 fastq-dump --split-files --gzip --skip-technical  ADD REPLY 0 Entering edit mode Nice! Thanks for adding this. ADD REPLY 0 Entering edit mode From where I can get efetch? I have error using this command. Please help me with this Regards, Bandana ADD REPLY 0 Entering edit mode I am very new in this fiels. I got esearch but when I am doing esearch -db sra -query PRJNA281410 | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | head -5 | xargs ~/bin/sratoolkit.2.7.0-centos_linux64/bin/fastq-dump -X 10 --split-files  It is giving me error like: param empty while validating argument list - expected accession  PRJNA281410 I am using this Project ID Please Help me with this. ADD REPLY 0 Entering edit mode I am trying to get fastq files from project "SRP074107" but getting this error. It will be highly appreciated if you please help me in this matter. Please check the attached screenshot for more details. [Screenshot][1] [1]: https://ibb.co/nfyfXb ADD REPLY 1 Entering edit mode You need to install esearch, which you can do using bioconda. ADD REPLY 0 Entering edit mode There are 6500+ samples in that study. You want to get them all? ADD REPLY 0 Entering edit mode Many thanks for your response, yes I need all these samples ADD REPLY 0 Entering edit mode I am trying to analyze data in project SRP114962, and am facing quite many issues with downloading it in the correct format. For example https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR5969329 is supposed to be 76bp single end sequencing, but I see 152bp (with 76bp for each read). At the same time, SRR5959411 is supposed to be PE (paired-end) and I see and am able to download only single end. Is there a way to download the data in the exact format it was deposited (like original FQS)? ADD REPLY 0 Entering edit mode There are 6629 samples in this project. Are you sure you want to download the data for all of them? You may be able to click on one of the samples and the use the Bulk Download Files button that you will find on the page to get all data. This data appears to be single-end. SRR5959411 is also single-end. ADD REPLY 0 Entering edit mode Use fastq-dump to get your SRA data, I don't trust mirrors that much actually... fastq-dump --split-files -X 1000 SRR5959411  gives me one file, this gives two files fastq-dump --split-files -X 1000 SRR5969329  the sizes are right, as expected paired and 76bp seqkit stat *.fastq  like so:  SRR5959411_1.fastq FASTQ DNA 1,000 76,000 76 76 76 SRR5969329_1.fastq FASTQ DNA 1,000 76,000 76 76 76 SRR5969329_2.fastq FASTQ DNA 1,000 76,000 76 76 76  It is true that the annotations are wrong esearch -db sra -query SRR5959411 | efetch --format runinfo | cut -d , -f 16  says: LibraryLayout PAIRED  but it does not seem so. ADD REPLY 7 Entering edit mode 6.7 years ago Shicheng Guo ★ 9.2k How to change the default fold of the downloading for prefetch? usually, the home directory is not big enougth for large number of SRA files? /home/[USER]/ncbi/public/sra/  Please be sure you sra-tools is always the lastest version or else some strange error maybe come out! Eventually, I find we can re-set the cache directory by the following step: 1. come to bin directory, such as: /home/sratoolkit.2.5.5-centos_linux64/bin 2. ./vdb-config -i 3. change the fold of Workspace Name to a big harddisk. Re-run fastq-dump, the cache files will come to new setted directory. Remember: There will be large files in: /home/shg047/oasis/ncbi/public/sra/sra, Be sure to remove them periodically. ADD COMMENT 0 Entering edit mode I also encountered this problem. ADD REPLY 0 Entering edit mode Daniel Standage's solution to this problem is described at https://standage.github.io/that-darn-cache-configuring-the-sra-toolkit.html This worked for me: mkdir -p ~/.ncbi echo '/repository/user/main/public/root = "/scratch/standage/sra-cache"' > ~/.ncbi/user-settings.mkfg  This is a rather old post I am responding to, but I am posting the solution here simply for record-keeping of viable solutions... ADD REPLY 3 Entering edit mode 7.7 years ago Tulip Nandu ▴ 90 One option is to download the fastq file and directly follow the pipeline. On some exome sequencing data fastq dump doesnt work as efficiently or gives errors as the deposited data is in bam format and requires genome file. So from the below website we can directly download the fastq files for all sequencing experiments in GEO. ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ ADD COMMENT 3 Entering edit mode 6.5 years ago Biojl ★ 1.7k Hi, I modified a bit the code so I can download several experiments from a download_list.txt (sample+'\t'+SRXcode). Unfortunately it only works for the first item in the list. It's something in the loop because they work fine separately. Any ideas? Thanks. # -*- coding: utf-8 -*- while read p1 p2; do #Get SRR srr="$(esearch -db sra -query $p2 | efetch -format runinfo | cut -d ',' -f 1 | grep SRR)" prefetch$srr && vdb-validate $srr && fastq-dump --split-files$srr

0
Entering edit mode

When building any type of script make sure to debug it first by put an echo in front of each command. That way you will see what actually takes place. Also you need to post this as a new question, it is not an answer to the original tutorial.

0
Entering edit mode

I did echo de $srr and it's correct but the loop only works once. Also the prefetch/validate/fastqdump works for itself when provded srr instead of sra in the download_list. I don't think this is novel enough to start a new thread, it's an extension of your answer 19 months ago. ADD REPLY 1 Entering edit mode What I mean is to echo everything not just that one variable and verify that the commands appear as they should. The results of echoing the commands could then be executed via bash. bash fancy_script_with_loops.sh > simple_commands.sh bash simple_commands.sh  What I am saying that there is no such thing that is works fine separately but not in the the script. Most likely when you generate it with the script is not the same thing that you think it is. Hence the solution is to make the script not execute the commands but write them all out and you can really see what it tries to do. ADD REPLY 3 Entering edit mode 5.8 years ago Payal ▴ 140 Addition:(Further Downstream RNA-Seq Analysis using GALAXY) In case anybody wants to get files from SRA directly into GALAXY for further analysis, please check out this other video too, Video: RNA-Seq Alignment and Visualization using Galaxy and IGB. It explains how to get data from SRA directly into GALAXY using Tools in GALAXY itself. I found this one useful so thought to share. ADD COMMENT 2 Entering edit mode 6.6 years ago macmath ▴ 160 When I performed this fastq-dump --outdir ~/Documents/USER/Re-analysisGPS2/fastq/ --split-files /home/USER/ncbi/public/sra/SRR1291261.sra  I did not receive 2 fastq file but only one file . Example: SRR1291261_1.fastq Kindly suggest me what could be the reason? Is it because its not paired end? ADD COMMENT 3 Entering edit mode You got a single file because this is a single end data set (see here as well: http://www.ebi.ac.uk/ena/data/view/SRR1291261 ). ADD REPLY 2 Entering edit mode 5.6 years ago al-ash ▴ 190 Note that it might be necessary to convert the *.sra to *.fastq using some special parametres supplied to fastq-dump to make it suitable for Trinity assembly (to take care of the read names), i.e. for paired reads: SRA_TOOLKIT/fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files file.sra

1
Entering edit mode
3.8 years ago
Renesh ★ 2.1k

How to use NCBI SRA toolkit effectively: Read this post https://reneshbedre.github.io/blog/fqutil.html

0
Entering edit mode

Speaking of efficiency, you might consider to include the use of Aspera with prefetch and parallel-fastq-dump in your blog.

1
Entering edit mode

Thanks for suggestions. NCBI SRA toolkit lastet release contains fasterq-dump which provides multithreading option for faster download.

0
Entering edit mode

True, but no gzip compression for splitted files :D at least not in the initial release