How to ensure sequential execution of rules in Snakefile one after another?
2
3
Entering edit mode
5.0 years ago
yifangt86 ▴ 60

Hello, I was struggling for a while about the execution of Snakefile rules. I have a scenario that some rules must be completed before the next one starts because of the dependency, especially when parallel processes are used with the -j option. I met this situation when the sample sizes are very different causing the completion of each step varied too much, some may be 1 hour difference. And this caused the execution of the rules snaked badly and caused error because the execution time cost are so different!

My question is: How to ensure Snakefile rules execution to complete one after another?

Following is a fragment of my Snakefile, and some details may be missing when I copy-n-paste.

SAMPLES, = glob_wildcards("raw/{sample}_R1.fq.gz")
READS=["R1", "R2"]

rule all:
  input: expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS),
         expand("seqs_trimmed/{sample}_trimmed_PE_{read}.fq.gz", sample=SAMPLES, read=READS),
         expand("qc_trimmed/{sample}_trimmed_PE_{read}_fastqc.html", sample=SAMPLES, read=READS),
         expand("alignment/{sample}.bam", sample=SAMPLES),
         expand("snps_out/SNPs_candidates.g.vcf"),
         expand("snps_out/combined_gvcf")

rule fastqc_raw:
    input:  expand("seqs_raw/{sample}_{read}.fq.gz", sample=SAMPLES, read=READS)
    output: expand("qc_raw/{sample}_{read}_fastqc.html", sample=SAMPLES, read=READS)
    log:    expand("logs/fastqc/{sample}_{read}_raw.fastqc.err", sample=SAMPLES, read=READS)
    shell: """ 
      mkdir -p logs/fastqc 
      mkdir -p qc_raw
      fastqc --outdir qc_raw  --thread 8 --nogroup {input} \
             >> data3/fastqc/raw/fastqc.log 2>>{log}
        """

rule trimming:
    input:
        fwd = "seqs_raw/{sample}_R1.fq.gz",
        rev = "seqs_raw/{sample}_R2.fq.gz"
    output:
        fwd_paired = "seqs_trimmed/{sample}_trimmed_PE_R1.fq.gz",
        fwd_unpaired = "seqs_trimmed/{sample}_trimmed_SE_R1.fq.gz",
        rev_paired = "seqs_trimmed/{sample}_trimmed_PE_R2.fq.gz",
        rev_unpaired = "seqs_trimmed/{sample}_trimmed_SE_R2.fq.gz"
    params:
       TRIMMOMATIC_JAR = "/path/to/download-software/trimmomatic-0.38-1/trimmomatic.jar",
       PAIR = "PE SE".split(),
       TRIM_OPTS = "-phred33 ILLUMINACLIP:" + ADAPTORS + ":2:30:10 LEADING:20 TRAILING:20 HEADCROP:18 SLIDINGWINDOW:4:20 MINLEN:36 "
      log:  "logs/trimmomatic/{sample}.PE.trimomatic.log"
      shell: """
        mkdir -p logs/trimmomatic
        java -jar {params.TRIMMOMATIC_JAR} \
           {params.PAIR[0]} {input.fwd} {input.rev} \
           {output.fwd_paired} {output.fwd_unpaired} \
           {output.rev_paired} {output.rev_unpaired} \
           {params.TRIM_OPTS}  2>{log}
        """

rule fastqc_trimmed:
    input:  expand("seqs_trimmed/{sample}_trimmed_PE_{read}.fq.gz", sample=SAMPLES, read=READS)
    output: expand("qc_trimmed/{sample}_trimmed_PE_{read}_fastqc.html", sample=SAMPLES, read=READS)
    log:    expand("logs/fastqc/{sample}_{read}_trimmed.fastqc.err", sample=SAMPLES, read=READS)
    shell: """ 
      mkdir -p qc_trimmed
      fastqc --outdir qc_trimmed --thread 8 --nogroup {input} \
             >> qc_trimmed/fastqc.log 2>>{log} 
          """

rule alignment:
        input: fwd="seqs_trimmed/{sample}_trimmed_PE_R1.fq.gz",
               rev="seqs_trimmed/{sample}_trimmed_PE_R2.fq.gz"
        output: "alignment/{sample}.bam"
        params: threads=8,
           indexbase=FASTA
        log:
           "logs/bwa/{sample}.log"
        shell: """
           mkdir -p alignment
           mkdir -p logs/bwa
           scripts/map_bwamem_v2.sh {input.fwd} {input.rev} {params.threads} {params.indexbase} | \
           samtools view -b -hF 256 - > {output} 2>{log}
        """

rule SNP_indiv:
        input: "snps_out/{sample}..bam"
        output: "snps_out/{sample}.raw.snps.indels.g.vcf"
        params: ref_fasta=FASTA,
                gatk4_jar="/storage/ppl/yifang/download-software/anaconda3/envs/exome/share/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar"
        log: "logs/SNPcalling/{sample}.snp_indiv.log"
        shell: """
             mkdir -p logs/SNPcalling
             java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
              -jar {params.gatk4_jar} HaplotypeCaller \
              --reference {params.ref_fasta} --input {input} \
              --emit-ref-confidence GVCF --output {output} 2>{log}
            """

rule combine_candidates:
        input:  "snps_out/GB4450.raw.snps.indels.g.vcf"
        output: "snps_out/SNPs_candidates.g.vcf"
        shell: """
              grep "^#" snps_out/GB4450.raw.snps.indels.g.vcf > snps_out/tmp.g.vcf
              grep -v "^#" <(cat snps_out/*.raw.snps.indels.g.vcf) | sort -V >> snps_out/tmp.g.vcf
              awk '!/Scaffold/ && !A[$1$2$3]++' snps_out/tmp.g.vcf > {output}
            """

rule combine_gvcf:
        input: "snps_out/{sample}.raw.snps.indels.g.vcf"
        output: "snps_out/combined_gvcf"
        params: gvcf_file_list="gvcf_files.list",
                gatk4_jar="/path/to/download-software/gatk4-4.1.0.0-0/gatk-package-4.1.0.0-local.jar"
        shell:"""
            java -DGATK_STACKTRACE_ON_USER_EXCEPTION=true \
                -jar {params.gatk4_jar} GenomicsDBImport -V {params.gvcf_file_list} \
                --genomicsdb-workspace-path {output}
           """

How to ensure the execution of each rule is completed before the one downstream starts? For example, trimming must be finished before fastqc_trimmed starts; and rule alignment must be finished before SNP_indiv starts, so on. Especially, SNP_indiv must be finished for ALL the samples before combine_candidates because combine_candidates is the union of the results all samples from SNP_indiv.

I am aware Snakemake will handle the execution of each rules and individual samples, but when dependency is related to ALL the samples, completion of the upstream rules must be guaranteed.

In bash script, the command wait can be used, what is the trick for this scenario in Snakemake?

Thanks a lot!

Snakefile execution completion • 5.9k views
ADD COMMENT
4
Entering edit mode
5.0 years ago
gb ★ 2.2k

If fastqc_trimmed can only start after trimming, you must give the output of trimming as an input for fastqc_trimmed

So for example:

rule trimming:
    input:
        fwd = "seqs_raw/{sample}_R1.fq.gz",
        rev = "seqs_raw/{sample}_R2.fq.gz"
    output:
        fwd_paired = "seqs_trimmed/{sample}_trimmed_PE_R1.fq.gz",
        fwd_unpaired = "seqs_trimmed/{sample}_trimmed_SE_R1.fq.gz",
        rev_paired = "seqs_trimmed/{sample}_trimmed_PE_R2.fq.gz",
        rev_unpaired = "seqs_trimmed/{sample}_trimmed_SE_R2.fq.gz"
    params:
       TRIMMOMATIC_JAR = "/path/to/download-software/trimmomatic-0.38-1/trimmomatic.jar",
       PAIR = "PE SE".split(),
       TRIM_OPTS = "-phred33 ILLUMINACLIP:" + ADAPTORS + ":2:30:10 LEADING:20 TRAILING:20 HEADCROP:18 SLIDINGWINDOW:4:20 MINLEN:36 "
      log:  "logs/trimmomatic/{sample}.PE.trimomatic.log"
      shell: """
        mkdir -p logs/trimmomatic
        java -jar {params.TRIMMOMATIC_JAR} \
           {params.PAIR[0]} {input.fwd} {input.rev} \
           {output.fwd_paired} {output.fwd_unpaired} \
           {output.rev_paired} {output.rev_unpaired} \
           {params.TRIM_OPTS}  2>{log}
        """

rule fastqc_trimmed:
    input:  
        inputFiles = expand("seqs_trimmed/{sample}_trimmed_PE_{read}.fq.gz", sample=SAMPLES, read=READS),
        trimmedFiles = rules.trimming.output.fwd_paired 
    output: expand("qc_trimmed/{sample}_trimmed_PE_{read}_fastqc.html", sample=SAMPLES, read=READS)
    log:    expand("logs/fastqc/{sample}_{read}_trimmed.fastqc.err", sample=SAMPLES, read=READS)
    shell: """ 
      mkdir -p qc_trimmed
      fastqc --outdir qc_trimmed --thread 8 --nogroup {input} \
             >> qc_trimmed/fastqc.log 2>>{log}
ADD COMMENT
0
Entering edit mode

Thanks! Could you please elaborate more of this line to ensure the completion of trimming before fastqc_trimmed starts?

rule fastqc_trimmed:
        input:  
            inputFiles = expand(...),
            trimmedFiles = rules.trimming.output.fwd_paired 
        output: ...
        log: ...
        shell: """
            fastqc --outdir qc_trimmed --thread 8 --nogroup {input} \
         >> qc_trimmed/fastqc.log 2>{log}"""

Also, is the shell command line should be like this?

     fastqc --outdir qc_trimmed --thread 8 --nogroup {input.inputFiles} \     #cf.  {input}
         >> qc_trimmed/fastqc.log 2>{log}

I could not find more information on this from the snakemake site. I seem to catch your idea, but not understand it in full. Thanks a lot again.

ADD REPLY
2
Entering edit mode
5.0 years ago
yifangt86 ▴ 60

Just found the source to handle sequential execution as Rule dependencies. Leave it here for reference, if it is not annoying.

From verion 2.4.8 on, rules can also refer to the output of other rules in the Snakefile, e.g.:

rule a:
    input:  "path/to/input"
    output: "path/to/output"
    shell:  ...

rule b:
    input:  rules.a.output
    output: "path/to/output/of/b"
    shell:  ...
  

Importantly, be aware that referring to rule a here requires that rule a was defined above rule b in the file, since the object has to be known already. This feature also allows to resolve dependencies that are ambiguous when using filenames.

Note that when the rule you refer to defines multiple output files but you want to require only a subset of those as input for another rule, you should name the output files and refer to them specifically:

rule a:
    input:  "path/to/input"
    output: a = "path/to/output", b = "path/to/output2"
    shell:  ...

rule b:
    input:  rules.a.output.a
    output: "path/to/output/of/b"
    shell:  ...
  
ADD COMMENT

Login before adding your answer.

Traffic: 2932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6