Question

Executing snakemake array jobs on cluster

2

Entering edit mode

3.7 years ago

Ivan ▴ 60

I have a list of SRA accession numbers that I download using sratoolkit's fasterq-dump. Since I have a number of samples, instead of downloading them serially, I take advantage of array jobs. The script I use is the following:

#!/bin/bash
#$ -cwd
#$ -V
#$ -t 1-44
read -a samples <<< $(cut -d , -f 1 linker.csv | tail -n +2)
false_index=$SGE_TASK_ID
true_index=$((false_index-1))
sample=${samples[$true_index]}
fasterq-dump $sample -O raw_samples >> downloading.log

Short version is that I read the files to download into an array. I specify the amount of threads with -t flag. Every task gets id ($SGE_TASK_ID variable) from 1 to 44. Based on that id, sample is retrieved via indexing, and then I run the command fasterq-dump to download it. This script works wonders.

But I want to implement it in a snakemake pipeline, and in order to make that pipeline work, I need to use it's -cluster flag. I've managed to write one snakemake rule to download files, but only serially. I have no idea how I'd implement parallelization (technically, an array job). The snakemake script is below:

import csv
def read_dictionary():
    with open("linker.csv") as csv_file:
        csv_reader=csv.reader(csv_file,delimiter=",")
        dic = {row[0]:row[2] for row in csv_reader}
    return(dic)

SRA_MAPPING = read_dictionary()
SRAFILES = list(SRA_MAPPING.keys())[1:]

rule download_files:
    output:
        "raw_samples/download.log"
    run:
        for file in SRAFILES:
            #shell("touch raw_samples/{file} >> raw_samples/download.log")
            shell("fasterq-dump {file} -O raw_samples >> {output}")

In a nutshell, it reads the samples into global variable SRAFILES, and then runs the python code that selects a file from said variable, then runs fasterq-dump . How would I implement "parallelization" of one job/rule?

snakemake cluster • 2.9k views

ADD COMMENT • link updated 3.7 years ago by dariober 15k • written 3.7 years ago by Ivan ▴ 60

score 3 · Accepted Answer · 2022-02-01

3

Entering edit mode

3.7 years ago

dariober 15k

So SRAFILES is a list of SRR IDs, right? If so, I would make the SRR id a wildcard and let snakemake handle the for-loop. E.g.:

SRAFILES = ['SRR123', 'SRR456']

rule all:
    input:
        expand('raw_samples/{srr}.log', srr= SRAFILES),

rule download_files:
    output:
        xlog= 'raw_samples/{srr}.log',
    shell:
        r"""
        fasterq-dump {wildcards.srr} -O raw_samples > {output.xlog}
        """

Then execute with -j option set to a reasonable number of jobs to run in parallel:

snakemake -j 10 -p --cluster "qsub ..."

Basically, you are asking snakemake to produce one log file per SRR id and this log file is produced by rule download_files. As a side product, download_files will give you the actual fastq files (things could be done differently in this respect but hopefully this will help...)

ADD COMMENT • link 3.7 years ago by dariober 15k

0

Entering edit mode

I will try that, just to see if it's done in parallel. Just before I try it out, I know that my list has let's say 10 elements. Could I implement -j 10 to specify that I want 10 parallel tasks be done? And for future proofing, if I have a number of rules, but only some require parallel processing, could I specify which rule should use how many threads? I guess what I'm asking is if there is a way to transfer "qsub -t 1-10" to rules that require parallel processing.

EDIT: That worked, thank you. For posterity, my full command line argument was: snakemake --jobs 1 --cluster "qsub -V -cwd -now y"

ADD REPLY • link 3.7 years ago by Ivan ▴ 60

0

Entering edit mode

Ivan, look into the resources directive to manage the number of jobs, memory usage etc. I haven't used it much but you should have quite a bit of flexibility there. If you get stuck, post here...

Could I implement -j 10 to specify that I want 10 parallel tasks be done?

Yes, that will have snakemake have at most 10 parallel jobs running. --jobs 1 effectively disables parallelization.

ADD REPLY • link 3.7 years ago by dariober 15k