Manage expected successful job exit but no output file in Snakemake
0
0
Entering edit mode
7 months ago

Hi,

I'm trying to subsample a large collection of processed fastq files in a Snakemake workflow. I want to sequentially subsample these files by one third, a total of six times. All my fastq files are different sizes, so I expect that some of the subsamples will be empty before the sixth iteration is complete. This will cause the program ( reformat.sh from bbtools suite ) to exit successfully but not output a file. As a result I get a MissingOutputException from Snakemake and my workflow for those particular samples fails. I tried using checkpoints to manage the dynamic number of output files, but what I would like is a way to handle the MissingOutputException that I get from Snakemake so that the checkpoint just registers that there are fewer files than expected. Does anyone know how to do that or can you suggest a better strategy for serially subsampling within Snakemake? Maybe I'm using checkpoints incorrectly? This is tangentially related to this stack overflow post, which gave a more clever solution for reusing the same snakemake rule (though I couldn't get it to work because of a CyclicGraphException, see Snakemake docs)

Snakemake error message:

MissingOutputException in line 187 of /home/nmb/skin_metag/process_reads/Snakefile:
Job completed successfully, but some output files are missing. Missing files after 1 seconds:
SRP188475/SRR8731712/SRR8731712.subset5.lca_gather.csv
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
  File "/home/nmb/.local/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
  File "/home/nmb/.local/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 225, in handle_job_success

Relevant code snippet is here:

#rules

rule all:
    input:
        expand("{run}.subset{subset}.lca_gather.csv", run = names_list, subset = ['1','2','3','4','5','6'])

def getFastq(run):
    return(names_dict[run])

def getOut(run):
    parts = run.split('/')
    out_prefix = '/'.join([parts[0],parts[1]])
    return(out_prefix)

rule fastqc:
    input:
        r1 = lambda wildcards: getFastq(wildcards.run)[0],
        r2 = lambda wildcards: getFastq(wildcards.run)[1]
    params:
        out_prefix =  lambda wildcards: getOut(wildcards.run)
    output:
        "{run}_1_fastqc.html",
        "{run}_1_fastqc.zip",
        "{run}_2_fastqc.html",
        "{run}_2_fastqc.zip"
    threads: max_threads
    message: "STEP 1 - Running FastQC to check raw read quality."
    shell: "fastqc -o {params.out_prefix}/ -f fastq {input.r1} {input.r2}"

rule bbduk_trim:
    input:
        r1 = lambda wildcards: getFastq(wildcards.run)[0],
        r2 = lambda wildcards: getFastq(wildcards.run)[1]
    params:
        out_prefix =  lambda wildcards: getOut(wildcards.run)
    output:
        "{run}.trimmed.fastq.gz",
        "{run}.bbduk.stats"
    threads: max_threads/10
    message: "STEP 2 - Removing sequencing adapters, control DNA sequences, etc., and quality trimming reads."
    shell: "bbduk.sh -Xmx10g in={input.r1} in2={input.r2} out=stdout.fq ref=contamination.fa k=27 ktrim=r hdist=2 mink=16 threads={threads} | bbduk.sh -Xmx10g threads={threads} int=t in=stdin.fq out={output[0]} ref=contamination.fa k=23 ktrim=r hdist=0 ow mink=10 stats={output[1]}"

rule bbmap_clean:
    input:
        "{run}.trimmed.fastq.gz"
    output:
        "{run}.nonhuman.fastq.gz",
        "{run}.human.fastq.gz",
        "{run}.bbmap.stats"
    threads: max_threads/10
    message: "STEP 3: Removing reads that map to the human genome and saving them aside."
    shell: "bbmap.sh minid=0.95 maxindel=3 bwr=0.16 bw=12 quickmatch fast minhits=2 path=./ qtrim=rl trimq=10 untrim -Xmx24g in={input} outu={output[0]} outm={output[1]} threads={threads} statsfile={output[2]}"

rule bbnorm_lowpass:
    input:
        "{run}.nonhuman.fastq.gz"
    output:
        "{run}.highpass.fastq.gz",
        "{run}.lowpass.fastq.gz",
        "{run}.bbnorm.stats"
    threads: max_threads/10
    message: "STEP 4 - Removing reads with k-mers that occur less than 3x."
    shell: "bbnorm.sh -Xmx20g in={input} out={output[0]} outt={output[1]} passes=1 target=999999999 min=3 threads={threads} histout={output[2]}"

rule sourmash_compute:
    input:
        "{run}.highpass.fastq.gz"
    output:
        "{run}.sig,^(?!subset).*"
    threads: 1
    message: "STEP 5 - Computing the minhash signature with sourmash."
    shell: "sourmash compute --track-abundance --scaled 1000 -k 21,31,51 {input} --output {output}"

rule sourmash_compare:
    input:
        signatures_list
    output:
        "compare_k31",
        "compare_k31.csv",
        "compare_k51",
        "compare_k51.csv"
    threads: max_threads
    message: "STEP 5 - Comparing all 31-mer (species level) and all all 51-mer (species level) minhash signatures against each other, accounting for kmer abundance (abundance projection)."
    shell: "sourmash compare --processes {threads} -k 31 {input} --output {output[0]} --csv {output[1]} & sourmash compare --processes {threads} -k 51 {input} --output {output[2]} --csv {output[3]}"

rule sourmash_gather:
    input:
        "{run}.sig",
        "genbank.lca"
    output:
        "{run}.lca_gather.csv,^(?!subset).*"
    threads: 1
    message: "STEP 6 - Gathering genomes contained in the metagenome with sourmash."
    shell: "sourmash gather -k 31 --output {output} {input[0]} {input[1]}"

checkpoint reformat_subset1:
    input:
        "{run}.highpass.fastq.gz"
    output:
        "{run}.subset1.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.1 - Subsetting processed reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint reformat_subset2:
    input:
        "{run}.subset1.fastq.gz"
    output:
        "{run}.subset2.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.2 - Subsetting subset #1 reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint reformat_subset3:
    input:
        "{run}.subset2.fastq.gz"
    output:
        "{run}.subset3.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.3 - Subsetting subset #2 reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint reformat_subset4:
    input:
        "{run}.subset3.fastq.gz"
    output:
        "{run}.subset4.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.4 - Subsetting subset #3 reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint reformat_subset5:
    input:
        "{run}.subset4.fastq.gz"
    output:
        "{run}.subset5.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.5 - Subsetting subset #4 reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint reformat_subset6:
    input:
        "{run}.subset5.fastq.gz"
    output:
        "{run}.subset6.fastq.gz"
    threads: max_threads/10
    message: "STEP 7.6 - Subsetting subset #5 reads by 0.3x."
    shell: "reformat.sh -Xmx20g in={input} out={output} samplerate=0.3 int=t"

checkpoint sourmash_subset_compute:
    input:
        "{run}.subset{subset}.fastq.gz"
    output:
        "{run}.subset{subset}.sig"
    threads: 1
    message: "STEP 8 - Computing the minhash signature with sourmash."
    shell: "sourmash compute --track-abundance --scaled 1000 -k 21,31,51 {input} --output {output}"

checkpoint sourmash_subset_gather:
    input:
        "{run}.subset{subset}.sig",
        "genbank.lca"
    output:
        "{run}.subset{subset}.lca_gather.csv"
    threads: 1
    message: "STEP 9 - Gathering genomes contained in the subset metagenome with sourmash."
    shell: "sourmash gather -k 31 --output {output} {input[0]} {input[1]}"
snakemake python shell • 481 views
ADD COMMENT
1
Entering edit mode

How do you have the output files defined in rule all?

ADD REPLY
0
Entering edit mode

I'll just edit the post to show all of the snakemake rules above.

ADD REPLY
0
Entering edit mode

I think the issue here is that you're telling snakemake to expect subset iteration 1 through 6 for all of your .fastq files. Do you need the intermediate subset .fastq files for what you're trying to achieve?

ADD REPLY
0
Entering edit mode

Assuming you do need all of your intermediate files, I found this answer on stackoverflow very helpful when I was doing something similar.

ADD REPLY

Login before adding your answer.

Traffic: 1654 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6