I am developing customized pipelines for ChIP-seq analysis using Snakemake. I want share it, so I created model workflows that people can execute immediatly after downloading the code. It handles file conversion, mapping, peak-calling... And uses public data from GEO database. However it requires people to download these data themselves. I would like to include an automatic download of the data (sra or fastq files), ideally by using GSM/GSE or SRR identifiers.
So far I've found several ways:
* SRA toolkit's fastq-dump function.
fastq-dump --outdir <outdir> <srr_ids>
However this way is insanely slow (as stated here).
* SRAdb R package
getSRAfile( in_acc = "<srr_ids>", sra_con = sra_con, destDir = <dir>, fileType = 'sra' )
This requires using this command first:
geometadbfile <- getSRAdbFile(destdir = <dir>, destfile = "SRAmetadb.sqlite.gz")
which downloads locally an sqlite file of 16Go. Could be fine if I were to use it locally, but I don't want users of my pipeline to be forced to do so...
* Biopython's Bio.Geo module
Not sure how this one works... http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc123
The object Entrez.esearch doesn't help me finding out the ftp URL or so.
I think there should be a way to download data in a more simple way?
Any idea will be greatly appreciated!