Question

Looking for ways to download ChIP-seq datasets programmatically in a pipeline

1

Entering edit mode

8.3 years ago

rioualen ▴ 710

Hello,

I am developing customized pipelines for ChIP-seq analysis using Snakemake. I want share it, so I created model workflows that people can execute immediatly after downloading the code. It handles file conversion, mapping, peak-calling... And uses public data from GEO database. However it requires people to download these data themselves. I would like to include an automatic download of the data (sra or fastq files), ideally by using GSM/GSE or SRR identifiers.

So far I've found several ways:

* SRA toolkit's fastq-dump function.

fastq-dump --outdir <outdir> <srr_ids>

However this way is insanely slow (as stated here).

* SRAdb R package

getSRAfile( in_acc = "<srr_ids>", sra_con = sra_con, destDir = <dir>, fileType = 'sra' )

This requires using this command first:

geometadbfile <- getSRAdbFile(destdir = <dir>, destfile = "SRAmetadb.sqlite.gz")

which downloads locally an sqlite file of 16Go. Could be fine if I were to use it locally, but I don't want users of my pipeline to be forced to do so...

* Biopython's Bio.Geo module

Not sure how this one works... http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc123

The object Entrez.esearch doesn't help me finding out the ftp URL or so.

I think there should be a way to download data in a more simple way?

Any idea will be greatly appreciated!

GEO ChIP-Seq next-gen • 3.6k views

ADD COMMENT • link updated 8.3 years ago by matted 7.8k • written 8.3 years ago by rioualen ▴ 710

0

Entering edit mode

This is not related to the main question, since my experience is limited and wouldn't be very useful. But I am interested in testing these pipelines of yours if they are publicly available!

ADD REPLY • link 8.3 years ago by Sam ▴ 100

0

Entering edit mode

Hi, thanks for your interest. My code is available on GitHub. Please note that it's under development, and there's still a lot to do! I'm also developing a virtual machine, in order to simplify the distribution.

ADD REPLY • link 8.3 years ago by rioualen ▴ 710

0

Entering edit mode

I sympathise with your frustration. Getting data and metadata from GEO programmatically doesn't seem to be straightforward.

ADD REPLY • link 8.3 years ago by dariober 14k

2

Entering edit mode

Just to be clear, GEO and SRA are two totally separate databases and GEO does not host sequencing data at all.

ADD REPLY • link 8.3 years ago by Sean Davis 26k

0

Entering edit mode

True; the main use of GEO would be finding common experimental data sets (would have to use elink to grab the SRA information).

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Chris Fields ★ 2.2k

1

Entering edit mode

8.3 years ago

Chris Fields ★ 2.2k

I don't think you will find a fast way to generate FASTQ test data sets directly from SRA; you're fighting against both extracting the data from archiving, decompressing the data (I believe this happens on the client end, not at NCBI), and competing network bandwidth. We used this method ourselves for a large RNA-Seq data retrieval and basically left the process running for a week or two to grab it all. We did a few more and found we could speed things up a bit using the Aspera interface and running SRA tools on the downloaded .sra file locally, but never bothered to look into it further past that initial data set (e.g. 'it worked for us even though it sucked').

To give an idea on some prior art in the same area, there is this paper: http://www.g3journal.org/content/4/2/209.long

They specifically mention using GEO for this for about 900 ChIP-Seq data sets. Might be worth contacting them to find out whether they ended up developing a pipeline for it.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Chris Fields ★ 2.2k

1

Entering edit mode

8.3 years ago

matted 7.8k

Last I checked, the EBI ENA has the data in fastq format and should (in theory) be a mirror of the NCBI SRA.

If you look at a given study there (for example this one), there's a "TEXT" link that will give you a tab-separated file where one column is full FTP download URLs (of the fastqs).

I haven't investigated as to whether there's an API or clean way to go about going through this, or if you'd just need to do it yourself in a hacky way. But it might save you time if you get to avoid the slow fastq-dump step, and particularly if you're already in Europe (though you lose Aspera, which may or may not tip the balance).

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by matted 7.8k

0

Entering edit mode

Thanks, this is very helpful! I don't know the ENA too much yet, I'll look it up.

ADD REPLY • link 8.3 years ago by rioualen ▴ 710

Ram · Accepted Answer · 2016-01-08

2

Entering edit mode

8.3 years ago

rioualen ▴ 710

I just found out here that I can download the sra files with SRA toolkit, using the prefetch command:

prefetch <SRR ID>

It's quite fast, the only issue is that it seems there is no output directory option. It's worth mentioning that the data is automatically downloaded to /home/<USER>/ncbi/public/sra/<SRR ID>.sra (not mentioned in the doc!)

I can then run fastq-dump to get fastq files:

fastq-dump --outdir <output directory> /home/<USER>/ncbi/public/sra/<SRR ID>.sra

Surprisingly, it looks a lot faster than doing these 2 steps at once with fast-dump...

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by rioualen ▴ 710

0

Entering edit mode

Never noticed that one! Nice to know; wonder if one could use that with Aspera...

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Chris Fields ★ 2.2k