Question

Snakemake doesn't recognize output files even though they are created

0

Entering edit mode

13 months ago

DdogBoss • 0

Part of a pipeline I am running results in output files, but the files are not recognized by Snakemake.

Code:

import os
import json
from datetime import datetime
from glob import iglob, glob
from snakemake.io import glob_wildcards

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Define Constants ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #

# discover input files using path from run config
SAMPLES = list(set(glob_wildcards(f"{config['fastq_dir']}/{{sample}}_R1_001.fastq.gz").sample))

# read output dir path from run config
OUTPUT_DIR = f"{config['output_dir']}"

# Project name and date for bam header
SEQID='yevo_pipeline_align'

#config['anc_r'] = list(set(glob_wildcards(f"{config['anc_dir']}/{{anc_r}}_R1_001.fastq.gz").anc_r))
#Error may have been in index ancestor_bam rule

anc_r = "/net/dunham/vol1/home/dennig2/yevo_pipeline/pipelineanc"
anc_t = "anc"
ref_dir = "/net/dunham/vol1/home/dennig2/yevo_pipeline/data/genome/sacCer3.fasta"

#TO DO change anc variables to anc_t
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Begin Pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #

# https://snakemake.readthedocs.io/en/v7.14.0/tutorial/basics.html#step-7-adding-a-target-rule 
rule all:
    input:
        f'{OUTPUT_DIR}/DONE.txt'


# ~~~~~~~~~~~~~~~~~~~~~~~~~~ Set Up Reference Files ~~~~~~~~~~~~~~~~~~~~~~~~~ #

#
# export the current run configuration in JSON format
#
rule export_run_config:
    output:
        path=f"{OUTPUT_DIR}/00_logs/00_run_config.json"
    run:
        with open(output.path, 'w') as outfile:
            json.dump(dict(config), outfile, indent=4)


#
# make a list of discovered samples
#
rule list_samples:
    output:
        f"{OUTPUT_DIR}/00_logs/00_sample_list.txt"
    shell:
        "echo -e '{}' > {{output}}".format('\n'.join(SAMPLES))


#
# copy the supplied reference genome fasta to the pipeline output directory for reference
#
rule copy_fasta:
    input:
        ref=f"{ref_dir}"
    output:
        f"{OUTPUT_DIR}/01_ref_files/reference"
    shell:
        "cp {input.ref} {output}"


rule index_fasta:
    input:
        rules.copy_fasta.output
    output:
        f"{rules.copy_fasta.output}.fai"
    conda:
        'envs/main.yml'
    shell:
        "samtools faidx {input}"


rule create_ref_dict:
    input:
        rules.copy_fasta.output
    output:
        f"{rules.copy_fasta.output}".rstrip('fasta') + 'dict'
    conda:
        'envs/main.yml'
    shell:
        "picard CreateSequenceDictionary -R {input}"

#
# create a BWA index from the copied fasta reference genome
#
rule create_bwa_index:
    input:
        rules.copy_fasta.output
    output:
        f"{rules.copy_fasta.output}.amb",
        f"{rules.copy_fasta.output}.ann",
        f"{rules.copy_fasta.output}.bwt",
        f"{rules.copy_fasta.output}.pac",
        f"{rules.copy_fasta.output}.sa",
    conda:
        'envs/main.yml'
    shell:
        "bwa index {input}"        

#
# Get the ancestor BAM
#



rule align_reads_anc:
    input:
        rules.create_bwa_index.output,
        R1= f"{anc_r}/YMD4612_pink_S1_R1_001.fastq.gz",
        R2= f"{anc_r}/YMD4612_pink_S1_R2_001.fastq.gz",
    output:
        bam=f"{OUTPUT_DIR}/03_init_alignment/{anc_t}/{anc_t}_R1R2_sort.bam",

    conda:
        'envs/main.yml'
    shell:
        r"""bwa mem -R '@RG\tID:""" + SEQID + r"""\tSM:""" + '{anc_t}' + r"""\tLB:1'""" + ' {rules.copy_fasta.output} {input.R1} {input.R2} | samtools sort -o {output.bam} - && samtools index {output.bam}'

Error:

rule list_samples:
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/pipelineoutput/00_logs/00_sample_list.txt
    jobid: 3
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/pipelineoutput/00_logs/00_sample_list.txt
    resources: tmpdir=/tmp/293419245.1.sage-long.q


[Tue Mar 28 21:47:11 2023]
rule copy_fasta:
    input: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/genome/sacCer3.fasta
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/pipelineoutput/01_ref_files/reference
    jobid: 7
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/pipelineoutput/01_ref_files/reference
    resources: tmpdir=/tmp/293419245.1.sage-long.q


[Tue Mar 28 21:47:11 2023]
rule gatk_register:
    input: workflow/envs/src/GenomeAnalysisTK-3.7-0-gcfedb67.tar.bz2
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/pipelineoutput/05_gatk/gatk_3.7_registered.txt
    jobid: 24
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/pipelineoutput/05_gatk/gatk_3.7_registered.txt
    resources: tmpdir=/tmp/293419245.1.sage-long.q

Activating conda environment: ../.snakemake/conda/82f61b1b43a83fa5851b5687321ca2bf_

[Tue Mar 28 21:47:11 2023]
rule export_run_config:
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/pipelineoutput/00_logs/00_run_config.json
    jobid: 2
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/pipelineoutput/00_logs/00_run_config.json
    resources: tmpdir=/tmp/293419245.1.sage-long.q

However, checking the output directory does see output from the export_run_config, list_samples, copy_fasta, and create_bwa_index rules. Why is Snakemake not recognizing the output files here?

python Snakemake • 2.3k views

ADD COMMENT • link 12 months ago by DdogBoss • 0

1

Entering edit mode

This is the reason why the rule is executed.

ADD REPLY • link 13 months ago by Shred ★ 1.4k

0

Entering edit mode

Ok, the rule is executed, but then it says the output files are missing.

I simply don't understand why there is a disconnect there, even after changing file permissions. I can go back and check if the file paths are correct again, but is there anything else that could be done?

ADD REPLY • link 13 months ago by DdogBoss • 0

0

Entering edit mode

Sorry but I'm not able to properly understand your scenario. Did your pipeline complete the dry run?

ADD REPLY • link 13 months ago by Shred ★ 1.4k

0

Entering edit mode

Yes, it completed the dry run. What other information do you need?

ADD REPLY • link 13 months ago by DdogBoss • 0

score 0 · Answer 1 · 2023-04-03

0

Entering edit mode

12 months ago

DdogBoss • 0

Update: the dry run did not complete, and all rules have missing output files.

I have changed the script so that it's calling the reference, output directory, and ancestor file from a config.yml. The beginning input seems to be wrong at the very beginning.

ADD COMMENT • link 12 months ago by DdogBoss • 0

0

Entering edit mode

Could you show the error?

ADD REPLY • link 12 months ago by raphael.B ▴ 520

0

Entering edit mode

Building DAG of jobs...
Job stats:
job                  count    min threads    max threads
-----------------  -------  -------------  -------------
all                      1              1              1
export_run_config        1              1              1
finish                   1              1              1
list_samples             1              1              1
total                    4              1              1


[Mon Apr  3 22:49:46 2023]
rule export_run_config:
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json
    jobid: 2
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json
    resources: tmpdir=/tmp/293542382.1.sage-long.q

[Mon Apr  3 22:49:46 2023]
rule list_samples:
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt
    jobid: 3
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt
    resources: tmpdir=/tmp/293542382.1.sage-long.q


[Mon Apr  3 22:49:46 2023]
rule finish:
    input: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json, /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt
    jobid: 1
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt; Input files updated by another job: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt, /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json
    resources: tmpdir=/tmp/293542382.1.sage-long.q


[Mon Apr  3 22:49:46 2023]
localrule all:
    input: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt
    jobid: 0
    reason: Input files updated by another job: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt
    resources: tmpdir=/tmp/293542382.1.sage-long.q

Job stats:
job                  count    min threads    max threads
-----------------  -------  -------------  -------------
all                      1              1              1
export_run_config        1              1              1
finish                   1              1              1
list_samples             1              1              1
total                    4              1              1

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        all, finish
    missing output files:
        export_run_config, finish, list_samples

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

DONE!

ADD REPLY • link 12 months ago by DdogBoss • 0

0

Entering edit mode

This was run with a "--printshellcmds" in the shell script:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job stats:
job                  count    min threads    max threads
-----------------  -------  -------------  -------------
all                      1              1              1
export_run_config        1              1              1
finish                   1              1              1
list_samples             1              1              1
total                    4              1              1

Select jobs to execute...

[Sun Apr  9 22:51:39 2023]
rule export_run_config:
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json
    jobid: 2
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json
    resources: tmpdir=/tmp/293703941.1.sage-long.q

[Sun Apr  9 22:51:39 2023]
rule list_samples:
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt
    jobid: 3
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt
    resources: tmpdir=/tmp/293703941.1.sage-long.q

echo -e '' > /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt
[Sun Apr  9 22:51:43 2023]
Finished job 3.
1 of 4 steps (25%) done
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Select jobs to execute...
[Sun Apr  9 22:51:45 2023]
Finished job 2.
2 of 4 steps (50%) done
Select jobs to execute...

[Sun Apr  9 22:51:45 2023]
rule finish:
    input: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json, /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt
    jobid: 1
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt; Input files updated by another job: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt, /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json
    resources: tmpdir=/tmp/293703941.1.sage-long.q

touch /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt
[Sun Apr  9 22:51:45 2023]
[Sun Apr  9 22:51:45 2023]
Finished job 1.
3 of 4 steps (75%) done
Select jobs to execute...

[Sun Apr  9 22:51:45 2023]
localrule all:
    input: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt
    jobid: 0
    reason: Input files updated by another job: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt
    resources: tmpdir=/tmp/293703941.1.sage-long.q

[Sun Apr  9 22:51:45 2023]
Finished job 0.
4 of 4 steps (100%) done
Complete log: .snakemake/log/2023-04-09T225125.847513.snakemake.log

ADD REPLY • link 12 months ago by DdogBoss • 0

0

Entering edit mode

Made some progress by changing the memory allocated per job:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job stats:
job                  count    min threads    max threads
-----------------  -------  -------------  -------------
all                      1              1              1
export_run_config        1              1              1
finish                   1              1              1
list_samples             1              1              1
total                    4              1              1

Select jobs to execute...

[Mon Apr 10 00:58:57 2023]
rule export_run_config:
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json
    jobid: 2
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json
    resources: tmpdir=/tmp/293705133.1.sage-long.q, mem_mb=2000, mem_mib=1908

[Mon Apr 10 00:58:57 2023]
rule list_samples:
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt
    jobid: 3
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt
    resources: tmpdir=/tmp/293705133.1.sage-long.q, mem_mb=2000, mem_mib=1908

[Mon Apr 10 00:59:01 2023]
Finished job 3.
1 of 4 steps (25%) done
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=2000, mem_mib=1908
Select jobs to execute...
[Mon Apr 10 00:59:06 2023]
Finished job 2.
2 of 4 steps (50%) done
Select jobs to execute...

[Mon Apr 10 00:59:06 2023]
rule finish:
    input: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json, /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt
    output: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt
    jobid: 1
    reason: Missing output files: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt; Input files updated by another job: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_run_config.json, /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/00_logs/00_sample_list.txt
    resources: tmpdir=/tmp/293705133.1.sage-long.q, mem_mb=2000, mem_mib=1908

[Mon Apr 10 00:59:06 2023]
Finished job 1.
3 of 4 steps (75%) done
Select jobs to execute...

[Mon Apr 10 00:59:06 2023]
localrule all:
    input: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt
    jobid: 0
    reason: Input files updated by another job: /net/dunham/vol1/home/dennig2/yevo_pipeline/data/pipelineoutput/DONE.txt
    resources: tmpdir=/tmp/293705133.1.sage-long.q, mem_mb=2000, mem_mib=1908

[Mon Apr 10 00:59:06 2023]
Finished job 0.
4 of 4 steps (100%) done
Complete log: .snakemake/log/2023-04-10T005851.653898.snakemake.log

ADD REPLY • link 12 months ago by DdogBoss • 0

0

Entering edit mode

Just to be clear, you have no eror in all of what you show. Your dry runs finish correctly. The missing file file statement is just the reason snakemake has to run this rule. If files were present and up to date there would be no point in creating them.