I have  a list of SRA accession numbers that I download using sratoolkit's fasterq-dump. Since I have a number of samples, instead of downloading them serially, I take advantage of array jobs. The script I use is the following:
#!/bin/bash
#$ -cwd
#$ -V
#$ -t 1-44
read -a samples <<< $(cut -d , -f 1 linker.csv | tail -n +2)
false_index=$SGE_TASK_ID
true_index=$((false_index-1))
sample=${samples[$true_index]}
fasterq-dump $sample -O raw_samples >> downloading.log
Short version is that I read the files to download into an array. I specify the amount of threads with -t flag. Every task gets id ($SGE_TASK_ID variable) from 1 to 44. Based on that id, sample is retrieved via indexing, and then I run the command fasterq-dump to download it. This script works wonders.
But I want to implement it in a snakemake pipeline, and in order to make that pipeline work, I need to use it's -cluster flag.  I've managed to write one snakemake rule to download files, but only serially. I have no idea how I'd implement parallelization (technically, an array job). The snakemake script is below:
import csv
def read_dictionary():
    with open("linker.csv") as csv_file:
        csv_reader=csv.reader(csv_file,delimiter=",")
        dic = {row[0]:row[2] for row in csv_reader}
    return(dic)
SRA_MAPPING = read_dictionary()
SRAFILES = list(SRA_MAPPING.keys())[1:]
rule download_files:
    output:
        "raw_samples/download.log"
    run:
        for file in SRAFILES:
            #shell("touch raw_samples/{file} >> raw_samples/download.log")
            shell("fasterq-dump {file} -O raw_samples >> {output}")
In a nutshell, it reads the samples into global variable SRAFILES, and then runs the python code that selects a file from said variable, then runs fasterq-dump . How would I implement "parallelization" of one job/rule?
I will try that, just to see if it's done in parallel. Just before I try it out, I know that my list has let's say 10 elements. Could I implement
-j 10to specify that I want 10 parallel tasks be done? And for future proofing, if I have a number of rules, but only some require parallel processing, could I specify which rule should use how many threads? I guess what I'm asking is if there is a way to transfer"qsub -t 1-10"to rules that require parallel processing.EDIT: That worked, thank you. For posterity, my full command line argument was:
snakemake --jobs 1 --cluster "qsub -V -cwd -now y"Ivan, look into the resources directive to manage the number of jobs, memory usage etc. I haven't used it much but you should have quite a bit of flexibility there. If you get stuck, post here...
Yes, that will have snakemake have at most 10 parallel jobs running.
--jobs 1effectively disables parallelization.