Question

Manage expected successful job exit but no output file in Snakemake

0

Entering edit mode

3.7 years ago

russianconcussion ▴ 30

Hi,

I'm trying to subsample a large collection of processed fastq files in a Snakemake workflow. I want to sequentially subsample these files by one third, a total of six times. All my fastq files are different sizes, so I expect that some of the subsamples will be empty before the sixth iteration is complete. This will cause the program ( reformat.sh from bbtools suite ) to exit successfully but not output a file. As a result I get a MissingOutputException from Snakemake and my workflow for those particular samples fails. I tried using checkpoints to manage the dynamic number of output files, but what I would like is a way to handle the MissingOutputException that I get from Snakemake so that the checkpoint just registers that there are fewer files than expected. Does anyone know how to do that or can you suggest a better strategy for serially subsampling within Snakemake? Maybe I'm using checkpoints incorrectly? This is tangentially related to this stack overflow post, which gave a more clever solution for reusing the same snakemake rule (though I couldn't get it to work because of a CyclicGraphException, see Snakemake docs)

Snakemake error message:

MissingOutputException in line 187 of /home/nmb/skin_metag/process_reads/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 1 seconds:
SRP188475/SRR8731712/SRR8731712.subset5.lca_gather.csv
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
  File "/home/nmb/.local/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
  File "/home/nmb/.local/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 225, in handle_job_success

Relevant code snippet is here:

#rules

rule all:
    input:
        expand("{run}.subset{subset}.lca_gather.csv", run = names_list, subset = ['1','2','3','4','5','6'])

def getFastq(run):
    return(names_dict[run])

def getOut(run):
    parts = run.split('/')
    out_prefix = '/'.join([parts[0],parts[1]])
    return(out_prefix)

rule fastqc:
    input:
        r1 = lambda wildcards: getFastq(wildcards.run)[0],
        r2 = lambda wildcards: getFastq(wildcards.run)[1]
    params:
        out_prefix =  lambda wildcards: getOut(wildcards.run)
    output:
        "{run}_1_fastqc.html",
        "{run}_1_fastqc.zip",
        "{run}_2_fastqc.html",
        "{run}_2_fastqc.zip"
    threads: max_threads
    message: "STEP 1 - Running FastQC to check raw read quality."
    shell: "fastqc -o {params.out_prefix}/ -f fastq {input.r1} {input.r2}"

rule bbduk_trim:
    input:
        r1 = lambda wildcards: getFastq(wildcards.run)[0],
        r2 = lambda wildcards: getFastq(wildcards.run)[1]
    params:
        out_prefix =  lambda wildcards: getOut(wildcards.run)
    output:
        "{run}.trimmed.fastq.gz",
        "{run}.bbduk.stats"
    threads: max_threads/10
    message: "STEP 2 - Removing sequencing adapters, control DNA sequences, etc., and quality trimming reads."
    shell: "bbduk.sh -Xmx10g in={input.r1} in2={input.r2} out=stdout.fq ref=contamination.fa k=27 ktrim=r hdist=2 mink=16 threads={threads} | bbduk.sh -Xmx10g threads={threads} int=t in=stdin.fq out={output[0]} ref=contamination.fa k=23 ktrim=r hdist=0 ow mink=10 stats={output[1]}"

rule bbmap_clean:
    input:
        "{run}.trimmed.fastq.gz"
    output:
        "{run}.nonhuman.fastq.gz",
        "{run}.human.fastq.gz",
        "{run}.bbmap.stats"
    threads: max_threads/10
    message: "STEP 3: Removing reads that map to the human genome and saving them aside."
    shell: "bbmap.sh minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=./ qtrim=rl trimq=10 untrim -Xmx24g in={input} outu={output[0]} outm={output[1]} threads={threads} statsfile={output[2]}"

rule bbnorm_lowpass:
    input:
        "{run}.nonhuman.fastq.gz"
    output:
        "{run}.highpass.fastq.gz",
        "{run}.lowpass.fastq.gz",
        "{run}.bbnorm.stats"
    threads: max_threads/10
    message: "STEP 4 - Removing reads with k-mers that occur less than 3x."
    shell: "bbnorm.sh -Xmx20g in={input} out={output[0]} outt={output[1]} passes=1 target=999999999 min=3 threads={threads} histout={output[2]}"

rule sourmash_compute:
    input:
        "{run}.highpass.fastq.gz"
    output:
        "{run}.sig,^(?!subset).*"
    threads: 1
    message: "STEP 5 - Computing the minhash signature with sourmash."
    shell: "sourmash compute --track-abundance --scaled 1000 -k 21,31,51 {input} --output {output}"

rule sourmash_compare:
    input:
        signatures_list
    output:
        "compare_k31",
        "compare_k31.csv",
        "compare_k51",
        "compare_k51.csv"
    threads: max_threads
    message: "STEP 5 - Comparing all 31-mer (species level) and all all 51-mer (species level) minhash signatures against each other, accounting for kmer abundance (abundance projection)."
    shell: "sourmash compare --processes {threads} -k 31 {input} --output {output[0]} --csv {output[1]} & sourmash compare --processes {threads} -k 51 {input} --output {output[2]} --csv {output[3]}"

rule sourmash_gather:
    input:
        "{run}.sig",
        "genbank.lca"
    output:
        "{run}.lca_gather.csv,^(?!subset).*"
    threads: 1
    message: "STEP 6 - Gathering genomes contained in the metagenome with sourmash."
    shell: "sourmash gather -k 31 --output {output} {input[0]} {input[1]}"

checkpoint reformat_subset1:
    input:
        "{run}.highpass.fastq.gz"
    output:
        "{run}.subset1.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.1 - Subsetting processed reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint reformat_subset2:
    input:
        "{run}.subset1.fastq.gz"
    output:
        "{run}.subset2.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.2 - Subsetting subset #1 reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint reformat_subset3:
    input:
        "{run}.subset2.fastq.gz"
    output:
        "{run}.subset3.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.3 - Subsetting subset #2 reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint reformat_subset4:
    input:
        "{run}.subset3.fastq.gz"
    output:
        "{run}.subset4.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.4 - Subsetting subset #3 reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint reformat_subset5:
    input:
        "{run}.subset4.fastq.gz"
    output:
        "{run}.subset5.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.5 - Subsetting subset #4 reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint reformat_subset6:
    input:
        "{run}.subset5.fastq.gz"
    output:
        "{run}.subset6.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.6 - Subsetting subset #5 reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint sourmash_subset_compute:
    input:
        "{run}.subset{subset}.fastq.gz"
    output:
        "{run}.subset{subset}.sig"
    threads: 1
    message: "STEP 8 - Computing the minhash signature with sourmash."
    shell: "sourmash compute --track-abundance --scaled 1000 -k 21,31,51 {input} --output {output}"

checkpoint sourmash_subset_gather:
    input:
        "{run}.subset{subset}.sig",
        "genbank.lca"
    output:
        "{run}.subset{subset}.lca_gather.csv"
    threads: 1
    message: "STEP 9 - Gathering genomes contained in the subset metagenome with sourmash."
    shell: "sourmash gather -k 31 --output {output} {input[0]} {input[1]}"

snakemake python shell • 2.3k views

ADD COMMENT • link updated 3.5 years ago by Biostar 20 • written 3.7 years ago by russianconcussion ▴ 30

1

Entering edit mode

How do you have the output files defined in rule all?

ADD REPLY • link 3.7 years ago by newbio17 ▴ 360

0

Entering edit mode

I'll just edit the post to show all of the snakemake rules above.

ADD REPLY • link 3.7 years ago by russianconcussion ▴ 30

0

Entering edit mode

I think the issue here is that you're telling snakemake to expect subset iteration 1 through 6 for all of your .fastq files. Do you need the intermediate subset .fastq files for what you're trying to achieve?

ADD REPLY • link 3.7 years ago by newbio17 ▴ 360

0

Entering edit mode

Assuming you do need all of your intermediate files, I found this answer on stackoverflow very helpful when I was doing something similar.

ADD REPLY • link 3.7 years ago by newbio17 ▴ 360