Question: Looking for ways to download ChIP-seq datasets programmatically in a pipeline
1
gravatar for rioualen
3.2 years ago by
rioualen330
France
rioualen330 wrote:

Hello,

I am developing customized pipelines for ChIP-seq analysis using Snakemake. I want share it, so I created model workflows that people can execute immediatly after downloading the code. It handles file conversion, mapping, peak-calling... And uses public data from GEO database. However it requires people to download these data themselves. I would like to include an automatic download of the data (sra or fastq files), ideally by using GSM/GSE or SRR identifiers.

So far I've found several ways:

* SRA toolkit's fastq-dump function.

fastq-dump --outdir <outdir> <srr_ids>

However this way is insanely slow (as stated here).

* SRAdb R package

getSRAfile( in_acc = "<srr_ids>", sra_con = sra_con, destDir = <dir>, fileType = 'sra' )

This requires using this command first:

geometadbfile <- getSRAdbFile(destdir = <dir>, destfile = "SRAmetadb.sqlite.gz")

which downloads locally an sqlite file of 16Go. Could be fine if I were to use it locally, but I don't want users of my pipeline to be forced to do so...

* Biopython's Bio.Geo module

Not sure how this one works... http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc123

The object Entrez.esearch doesn't help me finding out the ftp URL or so.


I think there should be a way to download data in a more simple way?

Any idea will be greatly appreciated!

chip-seq next-gen geo • 1.4k views
ADD COMMENTlink modified 3.2 years ago by matted7.0k • written 3.2 years ago by rioualen330

This is not related to the main question, since my experience is limited and wouldn't be very useful. But I am interested in testing these pipelines of yours if they are publicly available!

ADD REPLYlink written 3.2 years ago by Sam70

Hi, thanks for your interest. My code is available on GitHub. Please note that it's under development, and there's still a lot to do! I'm also developing a virtual machine, in order to simplify the distribution.

ADD REPLYlink written 3.2 years ago by rioualen330

I sympathise with your frustration. Getting data and metadata from GEO programmatically doesn't seem to be straightforward.

ADD REPLYlink written 3.2 years ago by dariober9.9k
2

Just to be clear, GEO and SRA are two totally separate databases and GEO does not host sequencing data at all.

ADD REPLYlink written 3.2 years ago by Sean Davis25k

True; the main use of GEO would be finding common experimental data sets (would have to use elink to grab the SRA information).  

ADD REPLYlink written 3.2 years ago by Chris Fields2.1k
2
gravatar for rioualen
3.2 years ago by
rioualen330
France
rioualen330 wrote:

I just found out here that I can download the sra files with SRA toolkit, using the prefetch command:

prefetch <SRR ID>

It's quite fast, the only issue is that it seems there is no output directory option. It's worth mentioning that the data is automatically downloaded to /home/<USER>/ncbi/public/sra/<SRR ID>.sra (not mentioned in the doc!)

I can then run fastq-dump to get fastq files:

fastq-dump --outdir <output directory> /home/<USER>/ncbi/public/sra/<SRR ID>.sra

Surprisingly, it looks a lot faster than doing these 2 steps at once with fast-dump...

ADD COMMENTlink written 3.2 years ago by rioualen330

Never noticed that one!  Nice to know; wonder if one could use that with Aspera...

ADD REPLYlink written 3.2 years ago by Chris Fields2.1k
1
gravatar for Chris Fields
3.2 years ago by
Chris Fields2.1k
University of Illinois Urbana-Champaign
Chris Fields2.1k wrote:

I don't think you will find a fast way to generate FASTQ test data sets directly from SRA; you're fighting against both extracting the data from archiving, decompressing the data (I believe this happens on the client end, not at NCBI), and competing network bandwidth.  We used this method ourselves for a large RNA-Seq data retrieval and basically left the process running for a week or two to grab it all.  We did a few more and found we could speed things up a bit using the Aspera interface and running SRA tools on the downloaded .sra file locally, but never bothered to look into it further past that initial data set (e.g. 'it worked for us even though it sucked').

To give an idea on some prior art in the same area, there is this paper:

http://www.g3journal.org/content/4/2/209.long

They specifically mention using GEO for this for about 900 ChIP-Seq data sets. Might be worth contacting them to find out whether they ended up developing a pipeline for it.

ADD COMMENTlink written 3.2 years ago by Chris Fields2.1k
1
gravatar for matted
3.2 years ago by
matted7.0k
Boston, United States
matted7.0k wrote:

Last I checked, the EBI ENA has the data in fastq format and should (in theory) be a mirror of the NCBI SRA.

If you look at a given study there (for example this one), there's a "TEXT" link that will give you a tab-separated file where one column is full FTP download URLs (of the fastqs).

I haven't investigated as to whether there's an API or clean way to go about going through this, or if you'd just need to do it yourself in a hacky way.  But it might save you time if you get to avoid the slow fastq-dump step, and particularly if you're already in Europe (though you lose Aspera, which may or may not tip the balance).

ADD COMMENTlink written 3.2 years ago by matted7.0k

Thanks, this is very helpful! I don't know the ENA too much yet, I'll look it up.

ADD REPLYlink written 3.2 years ago by rioualen330
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1079 users visited in the last hour