Question

Anyone know any clever snakemake/SLURM tricks to run a big analysis with limited storage?

0

Entering edit mode

2.3 years ago

steel1990 ▴ 20

I am using a SLURM HPC to run jobs and have ran into issues with storage. I have 3TB storage, and want to run over 1000 publicly available RNAseq data through my pipeline, which includes aligning with STAR. Obviously I must download the data in sections and run the pipeline multiple times.

Does anybody know any clever tricks to streamline this process?

Is there any way to configure a snakemake/slurm pipeline to lets say, run for 30 files, with all files expect counts being temp files, then once completed, run again for the next 30 files in a download list, and so on?

Any advice or guidance would be greatly appreciated !

SLURM genomics RNAseq Snakemake • 2.0k views

ADD COMMENT • link updated 2.3 years ago by Eric Lim ★ 2.1k • written 2.3 years ago by steel1990 ▴ 20

1

Entering edit mode

Do you need the alignments or just counts?

ADD REPLY • link 2.3 years ago by ATpoint 81k

0

Entering edit mode

I only need the count files.. I have maybe 6 rules, a few of which have STAR alignment steps.

ADD REPLY • link 2.3 years ago by steel1990 ▴ 20

1

Entering edit mode

Then look at recount3: http://rna.recount.bio/

recount3 is an online resource consisting of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for human and mouse respectively. It is the third generation of the ReCount project and part of recount.bio.

Maybe you don't have to align alything yourself. And even if then use a fast lightweight aligner such as salmon which takes a fraction of time and memory compared to STAR. But really, use recount.

ADD REPLY • link 2.3 years ago by ATpoint 81k

0

Entering edit mode

Salmon isn't appropriate for the analysis I am trying to do unfortunately. Neither are already processed counts.. Very interesting resource though, thanks for the tip !

ADD REPLY • link 2.3 years ago by steel1990 ▴ 20

score 2 · Answer 1 · 2021-12-28

2

Entering edit mode

2.3 years ago

Eric Lim ★ 2.1k

Have you tried --batch?

https://snakemake.readthedocs.io/en/stable/executing/cli.html?highlight=batch#dealing-with-very-large-workflows

ADD COMMENT • link 2.3 years ago by Eric Lim ★ 2.1k

0

Entering edit mode

This looks to be exactly what I'm looking for.. If this was a very obvious question I apologise, I am relatively new to bioinformatics. I am still unsure how I could use this to streamline downloading SRA data in batches though? I would still have to manually change a download rule to get the next files down my list each time wouldn't I? Thanks again for the help.

ADD REPLY • link 2.3 years ago by steel1990 ▴ 20

1

Entering edit mode

Consider this contrived example.

(base) [~/Downloads/scratch/biostar/use_batch_flag]$ cat Snakefile 

# arbitrarily define some sra numbers
sra = range(0, 10)

rule run_workflow:
    input: expand('{sra}/{sra}.bam', sra=sra)

rule download_src:
    output:
        fq = touch('{sra}/{sra}.fq.gz')
    run:
        # implement code to download fq
        pass

rule run_star:
    input:
        fq = '{sra}/{sra}.fq.gz'
    output:
        bam = touch('{sra}/{sra}.bam')
    run:
        # implement code to run star alignment
        pass

Without --batch, snakemake would attempt to run all the samples.

(base) [~/Downloads/scratch/biostar/use_batch_flag]$ snakemake run_workflow --summary
Building DAG of jobs...
output_file date    rule    version log-file(s) status  plan
0/0.bam -   -   -   -   missing update pending
0/0.fq.gz   -   -   -   -   missing update pending
1/1.bam -   -   -   -   missing update pending
1/1.fq.gz   -   -   -   -   missing update pending
2/2.bam -   -   -   -   missing update pending
2/2.fq.gz   -   -   -   -   missing update pending
3/3.bam -   -   -   -   missing update pending
3/3.fq.gz   -   -   -   -   missing update pending
4/4.bam -   -   -   -   missing update pending
4/4.fq.gz   -   -   -   -   missing update pending
5/5.bam -   -   -   -   missing update pending
5/5.fq.gz   -   -   -   -   missing update pending
6/6.bam -   -   -   -   missing update pending
6/6.fq.gz   -   -   -   -   missing update pending
7/7.bam -   -   -   -   missing update pending
7/7.fq.gz   -   -   -   -   missing update pending
8/8.bam -   -   -   -   missing update pending
8/8.fq.gz   -   -   -   -   missing update pending
9/9.bam -   -   -   -   missing update pending
9/9.fq.gz   -   -   -   -   missing update pending

With --batch run_workflow=1/10, snakemake will run what's needed to generate 1 bam.

(base) [~/Downloads/scratch/biostar/use_batch_flag]$ snakemake run_workflow --batch run_workflow=1/10 --summary
Building DAG of jobs...
Considering only batch 1/10 (rule run_workflow) for DAG computation.
All jobs beyond the batching rule are omitted until the final batch.
Don't forget to run the other batches too.
output_file date    rule    version log-file(s) status  plan
0/0.bam -   -   -   -   missing update pending
0/0.fq.gz   -   -   -   -   missing update pending

In addition to --batch, you can also use temp in your Snakefile to automatically delete your intermediate files. https://snakemake.readthedocs.io/en/stable/tutorial/short.html?highlight=temp#temporary-files

Hope this is helpful.

ADD REPLY • link 2.3 years ago by Eric Lim ★ 2.1k

1

Entering edit mode

This is very helpful! Thank you for taking the time to write this.

ADD REPLY • link 2.3 years ago by steel1990 ▴ 20

1

Entering edit mode

Glad to hear! You're doing me a favor. I'm off this week and I'm bored outta my mind.

ADD REPLY • link 2.3 years ago by Eric Lim ★ 2.1k