Question

How to automate the conversion of a large number of .sra files to .fastq?

0

Entering edit mode

5.7 years ago

Jeff M • 0

Hello,

Sorry if this is a rather basic question - but I'm completely new the the field of bioinformatics and have essentially no coding experience. I've been able to find some other similar questions asked previously, but the solutions provided don't seem to work (possibly because of SRA toolkit updates?) or are written to be run in a Unix environment (Bash? I think), while I'm trying to work in Windows. I would appreciate any advice anyone could provide on my issue.

I'm trying to download a rather large RNAseq dataset (GSE62772) for reanalysis - such that I want to download the fastq files, align them via kallisto, and analyze for differential expression. I know how to download/convert individual runs using fastq-dump, but I can't quite figure out how to run this for a large number of samples in an automated manner. A previous answer:

fastq-dump --split-3 --gzip $(<SraAccList.txt)

doesn't seem to work for me - giving an error that it wasn't able to recognize the input. I was able to use the accession list to download .sra files using:

prefetch --option-file SraAccList.txt

However, at this point I have no idea how to convert these to fastq besides individually. I've seen some answers on here e.g.

cat SRR_list.txt | xargs -n 1 bash get_SRR_data.sh

but from what I can tell this is meant to be run in a unix environment - whereas before this I've been running everything through the windows command prompt. Is there any way to similarly run this process in a Windows environment? I've also seen other resources suggesting downloading and converting the files through R, but I think I'd end up at the same issue where I would need to run kallisto manually on 166 files, which doesn't seem reasonable. Given that the accession numbers are all sequential it should be possible to run a for loop - but I'm not familiar enough with any language (having only worked in MATLAB before) to know what the best way of doing this is.

Does anyone have any suggestions on the best (and least involved) way of doing converting and analyzing a large number of files? This isn't something I intend to be doing regularly, so I've been trying to find quick solutions (that doesn't involve learning a new language or environment), but I'm starting to think that might not be possible. Any help would be greatly appreciated!

RNA-Seq • 1.5k views

ADD COMMENT • link updated 5.7 years ago by Kevin Blighe 90k • written 5.7 years ago by Jeff M • 0

0

Entering edit mode

If you're going to do any real bioinformatics on Windows, you need to acquaint yourself with the Window Subsystem for Linux. It will make your life much easier.

ADD REPLY • link 5.7 years ago by jared.andrews07 ★ 19k

score 3 · Answer 1 · 2020-03-01

3

Entering edit mode

5.7 years ago

Kevin Blighe 90k

The FASTQ files are available direct from here: https://www.ebi.ac.uk/ena/data/view/PRJNA265099

I found it by searching, at ENA, for the BioProject ID listed on the GEO accession record.

Kevin

ADD COMMENT • link 5.7 years ago by Kevin Blighe 90k

1

Entering edit mode

Bulk download Java tool button located on the page @Kevin linked can be used to download the files in bulk.

Another option would be to use sra-explorer from Phil Ewels to get download links for all files in bulk: sra-explorer : find SRA and FastQ download URLs in a couple of clicks Search for PRJNA265099.

ADD REPLY • link 5.7 years ago by GenoMax 154k