Question

How to ensure sequential execution of rules in Snakefile one after another?

3

Entering edit mode

5.0 years ago

yifangt86 ▴ 60

Hello, I was struggling for a while about the execution of Snakefile rules. I have a scenario that some rules must be completed before the next one starts because of the dependency, especially when parallel processes are used with the -j option. I met this situation when the sample sizes are very different causing the completion of each step varied too much, some may be 1 hour difference. And this caused the execution of the rules snaked badly and caused error because the execution time cost are so different!

My question is: How to ensure Snakefile rules execution to complete one after another?

Following is a fragment of my Snakefile, and some details may be missing when I copy-n-paste.

SAMPLES, = glob_wildcards("raw/{sample}_R1.fq.gz")
READS=["R1", "R2"]

rule all:
  input: expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS),
         expand("seqs_trimmed/{sample}_trimmed_PE_{read}.fq.gz", sample=SAMPLES, read=READS),
         expand("qc_trimmed/{sample}_trimmed_PE_{read}_fastqc.html", sample=SAMPLES, read=READS),
         expand("alignment/{sample}.bam", sample=SAMPLES),
         expand("snps_out/SNPs_candidates.g.vcf"),
         expand("snps_out/combined_gvcf")

rule fastqc_raw:
    input:  expand("seqs_raw/{sample}_{read}.fq.gz", sample=SAMPLES, read=READS)
    output: expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS)
    log:    expand("logs/fastqc/{sample}_{read}_raw.fastqc.err", sample=SAMPLES, read=READS)
    shell: """ 
      mkdir -p logs/fastqc 
      mkdir -p qc_raw
      fastqc --outdir qc_raw  --thread 8 --nogroup {input} \
             >> data3/fastqc/raw/fastqc.log 2>>{log}
        """

rule trimming:
    input:
        fwd = "seqs_raw/{sample}_R1.fq.gz",
        rev = "seqs_raw/{sample}_R2.fq.gz"
    output:
        fwd_paired = "seqs_trimmed/{sample}_trimmed_PE_R1.fq.gz",
        fwd_unpaired = "seqs_trimmed/{sample}_trimmed_SE_R1.fq.gz",
        rev_paired = "seqs_trimmed/{sample}_trimmed_PE_R2.fq.gz",
        rev_unpaired = "seqs_trimmed/{sample}_trimmed_SE_R2.fq.gz"
    params:
       TRIMMOMATIC_JAR = "/path/to/download-software/trimmomatic-0.38-1/trimmomatic.jar",
       PAIR = "PE SE".split(),
       TRIM_OPTS = "-phred33 ILLUMINACLIP:" + ADAPTORS + ":2:30:10 LEADING:20 TRAILING:20 HEADCROP:18 SLIDINGWINDOW:4:20 MINLEN:36 "
      log:  "logs/trimmomatic/{sample}.PE.trimomatic.log"
      shell: """
        mkdir -p logs/trimmomatic
        java -jar {params.TRIMMOMATIC_JAR} \
           {params.PAIR[0]} {input.fwd} {input.rev} \
           {output.fwd_paired} {output.fwd_unpaired} \
           {output.rev_paired} {output.rev_unpaired} \
           {params.TRIM_OPTS}  2>{log}
        """

rule fastqc_trimmed:
    input:  expand("seqs_trimmed/{sample}_trimmed_PE_{read}.fq.gz", sample=SAMPLES, read=READS)
    output: expand("qc_trimmed/{sample}_trimmed_PE_{read}_fastqc.html", sample=SAMPLES, read=READS)
    log:    expand("logs/fastqc/{sample}_{read}_trimmed.fastqc.err", sample=SAMPLES, read=READS)
    shell: """ 
      mkdir -p qc_trimmed
      fastqc --outdir qc_trimmed --thread 8 --nogroup {input} \
             >> qc_trimmed/fastqc.log 2>>{log} 
          """

rule alignment:
        input: fwd="seqs_trimmed/{sample}_trimmed_PE_R1.fq.gz",
               rev="seqs_trimmed/{sample}_trimmed_PE_R2.fq.gz"
        output: "alignment/{sample}.bam"
        params: threads=8,
           indexbase=FASTA
        log:
           "logs/bwa/{sample}.log"
        shell: """
           mkdir -p alignment
           mkdir -p logs/bwa
           scripts/map_bwamem_v2.sh {input.fwd} {input.rev} {params.threads} {params.indexbase} | \
           samtools view -b -hF 256 - > {output} 2>{log}
        """

rule SNP_indiv:
        input: "snps_out/{sample}..bam"
        output: "snps_out/{sample}.raw.snps.indels.g.vcf"
        params: ref_fasta=FASTA,
                gatk4_jar="/storage/ppl/yifang/download-software/anaconda3/envs/exome/share/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar"
        log: "logs/SNPcalling/{sample}.snp_indiv.log"
        shell: """
             mkdir -p logs/SNPcalling
             java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
              -jar {params.gatk4_jar} HaplotypeCaller \
              --reference {params.ref_fasta} --input {input} \
              --emit-ref-confidence GVCF --output {output} 2>{log}
            """

rule combine_candidates:
        input:  "snps_out/GB4450.raw.snps.indels.g.vcf"
        output: "snps_out/SNPs_candidates.g.vcf"
        shell: """
              grep "^#" snps_out/GB4450.raw.snps.indels.g.vcf > snps_out/tmp.g.vcf
              grep -v "^#" <(cat snps_out/*.raw.snps.indels.g.vcf) | sort -V >> snps_out/tmp.g.vcf
              awk '!/Scaffold/ && !A[$1$2$3]++' snps_out/tmp.g.vcf > {output}
            """

rule combine_gvcf:
        input: "snps_out/{sample}.raw.snps.indels.g.vcf"
        output: "snps_out/combined_gvcf"
        params: gvcf_file_list="gvcf_files.list",
                gatk4_jar="/path/to/download-software/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar"
        shell:"""
            java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
                -jar {params.gatk4_jar} GenomicsDBImport -V {params.gvcf_file_list} \
                --genomicsdb-workspace-path {output}
           """

How to ensure the execution of each rule is completed before the one downstream starts? For example, trimming must be finished before fastqc_trimmed starts; and rule alignment must be finished before SNP_indiv starts, so on. Especially, SNP_indiv must be finished for ALL the samples before combine_candidates because combine_candidates is the union of the results all samples from SNP_indiv.

I am aware Snakemake will handle the execution of each rules and individual samples, but when dependency is related to ALL the samples, completion of the upstream rules must be guaranteed.

In bash script, the command wait can be used, what is the trick for this scenario in Snakemake?

Thanks a lot!

Snakefile execution completion • 5.9k views

ADD COMMENT • link 5.0 years ago by yifangt86 ▴ 60

score 4 · Answer 1 · 2019-04-24

If fastqc_trimmed can only start after trimming, you must give the output of trimming as an input for fastqc_trimmed

So for example:

rule trimming:
    input:
        fwd = "seqs_raw/{sample}_R1.fq.gz",
        rev = "seqs_raw/{sample}_R2.fq.gz"
    output:
        fwd_paired = "seqs_trimmed/{sample}_trimmed_PE_R1.fq.gz",
        fwd_unpaired = "seqs_trimmed/{sample}_trimmed_SE_R1.fq.gz",
        rev_paired = "seqs_trimmed/{sample}_trimmed_PE_R2.fq.gz",
        rev_unpaired = "seqs_trimmed/{sample}_trimmed_SE_R2.fq.gz"
    params:
       TRIMMOMATIC_JAR = "/path/to/download-software/trimmomatic-0.38-1/trimmomatic.jar",
       PAIR = "PE SE".split(),
       TRIM_OPTS = "-phred33 ILLUMINACLIP:" + ADAPTORS + ":2:30:10 LEADING:20 TRAILING:20 HEADCROP:18 SLIDINGWINDOW:4:20 MINLEN:36 "
      log:  "logs/trimmomatic/{sample}.PE.trimomatic.log"
      shell: """
        mkdir -p logs/trimmomatic
        java -jar {params.TRIMMOMATIC_JAR} \
           {params.PAIR[0]} {input.fwd} {input.rev} \
           {output.fwd_paired} {output.fwd_unpaired} \
           {output.rev_paired} {output.rev_unpaired} \
           {params.TRIM_OPTS}  2>{log}
        """

rule fastqc_trimmed:
    input:  
        inputFiles = expand("seqs_trimmed/{sample}_trimmed_PE_{read}.fq.gz", sample=SAMPLES, read=READS),
        trimmedFiles = rules.trimming.output.fwd_paired 
    output: expand("qc_trimmed/{sample}_trimmed_PE_{read}_fastqc.html", sample=SAMPLES, read=READS)
    log:    expand("logs/fastqc/{sample}_{read}_trimmed.fastqc.err", sample=SAMPLES, read=READS)
    shell: """ 
      mkdir -p qc_trimmed
      fastqc --outdir qc_trimmed --thread 8 --nogroup {input} \
             >> qc_trimmed/fastqc.log 2>>{log}

score 2 · Answer 2 · 2019-04-24

Just found the source to handle sequential execution as Rule dependencies. Leave it here for reference, if it is not annoying.

From verion 2.4.8 on, rules can also refer to the output of other rules in the Snakefile, e.g.:
rule a:
    input:  "path/to/input"
    output: "path/to/output"
    shell:  ...

rule b:
    input:  rules.a.output
    output: "path/to/output/of/b"
    shell:  ...
  
Importantly, be aware that referring to rule a here requires that rule a was defined above rule b in the file, since the object has to be known already. This feature also allows to resolve dependencies that are ambiguous when using filenames.

Note that when the rule you refer to defines multiple output files but you want to require only a subset of those as input for another rule, you should name the output files and refer to them specifically:
rule a:
    input:  "path/to/input"
    output: a = "path/to/output", b = "path/to/output2"
    shell:  ...

rule b:
    input:  rules.a.output.a
    output: "path/to/output/of/b"
    shell:  ...