GCP Snakemake
0
0
Entering edit mode
3 months ago
Fadwa ▴ 10

Hi,

I'm working on a project aims to use SRR files directly from GCP storage of NCBI in a snakemake pipeline. I'm wondering if it is possible to work on these files without uploading them on local or on VM. I'm looking for a fast way to use these data, because they took a lot of time to be dowloaded using parallel-fastq-dump

Thanks is advance for your response.

Best regards

snakemake GCP • 610 views
ADD COMMENT
0
Entering edit mode

uploading them on local or on VM

What do you mean by that? Would data transfer from GCP to your local workspace not be downloading? You can stream files as required but downloading would be the way to go if you need random access or need to use the same file multiple times. You can always add a step in your pipeline to remove files once you're done using them.

ADD REPLY
0
Entering edit mode

I mean that I want to do analysis without going through downloading files (because they are heavy). Like, I'm wondering if there is a repository where NCBI can give us access to work directly on these files without keeping them in our local or VM machine.

I execute my snakemake pipeline directly inside a GCP VM machine.

ADD REPLY
2
Entering edit mode

SRA data is available in the cloud but not in the way you are imagining it. You can't compute on the data while it is still in NCBI's bucket. You will need to download the data into your own. But within the google infrastructure the movement of data should be reasonably fast.

ADD REPLY
0
Entering edit mode

Thank you for your answer. Except SRAtoolkit I don't know if there is a specific GCP infrastructure to download SRA

ADD REPLY
1
Entering edit mode

On SRA records you should be able to see gs links like (if available): gs://sra-pub-zq-4/SRR3452345/SRR3452345... You can then use gsutils to copy the data over to your VM assuming there is free egress (for many sets there should be).

ADD REPLY
0
Entering edit mode

Thank you so much! I'll try it

ADD REPLY
1
Entering edit mode

where NCBI can give us access to work directly on these files without keeping them in our local or VM machine

A computer cannot work on data that it does not have access to. You seem to be looking for an API or a platform that provides access to arbitrary files and I don't think that exists. Given immediate-access data storage is the most expensive part of high throughput sequence processing, I don't think anyone would be willing to provide such a platform to the general public.

ADD REPLY
0
Entering edit mode

Thank you so much :) :)

ADD REPLY

Login before adding your answer.

Traffic: 1875 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6