Snakemake rule that runs an assessment for once after completing other previous rules
0
0
Entering edit mode
7 weeks ago
Fadlilah ▴ 10

Hello everyone,

I want to run the rule 'assessment' once after classify_sample rule is done. The program in the rule assessment only takes a folder of the classify_sample results. However, when I use the output of the classify_sample as input for the rule assessment, it naturally runs so many times, as much as the given input (the code lines that I commented out). Otherwise, how should I ensure that the assessment rule runs after classify_sample?

Based on the documentation, I tried out using touch() function to cue the assessment rule to run after classify_sample. but it gave me the following error:

SyntaxError: Not all output, log and benchmark files of rule classify_sample contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file

It seems like it identified the 'classification.done' document as an output to be produced every time the program runs. Therefore, it asks me to have consistent wildcard usage. But I need this document only to be produced once after the classify_sample rule is done, then enforce the assessment rule to run with this document as an input.

I haven't figured out what seems to be the correct way to ensure the assessment rule runs after the classify_sample rule. I hope anyone can give me some advice.

import glob

FASTA_EXT=('*.fasta.gz', '*.fna.gz', '*.fa.gz', '*.fasta', '*.fna', '*.fa')
REFERENCES_FILES=[]
for ext in FASTA_EXT:
    REFERENCES_FILES.extend(glob.glob(config["path_to_references"]+ext))

REFERENCES = [os.path.basename(ref).split('.')[0] + '.' + os.path.basename(ref).split('.')[1] for ref in REFERENCES_FILES]

FASTQ_EXT=('*.fq.gz', '*.fastq.gz', '*.fq', '*.fastq')
SAMPLE_PATH = config["path_to_samples"]
SAMPLE_FILES = []
for ext in FASTQ_EXT:
    SAMPLE_FILES.extend(glob.glob(config["path_to_samples"]+ext))

SAMPLES = [os.path.basename(file).split('.')[0] for file in SAMPLE_FILES]

rule all:
    input:
        # expand(config["path_to_assessment_result"]+'result_all.csv'),
        # expand(config["path_to_assessment_result"] + '{reference}/{sample}-k25.PNG', sample=SAMPLES, reference=REFERENCES),
        expand(config["path_to_classification_result"] + "{reference}/{sample}-k25.log", sample=SAMPLES, reference=REFERENCES)

rule assessment:
output: 
    csv=config["path_to_assessment_result"]+'result_all.csv',
    # visual=config["path_to_assessment_result"] + '{reference}/{sample}-k{k}.PNG',
input:
    # log=glob.glob(config["path_to_classification_result"] + "**/{sample}-k{k}.log"),
    # config["path_to_classification_result"] + "{reference}/{sample}-k{k}.log"
    "classification.done"
params:
    path=config["path_to_assessment_result"],
    path_logs=config["path_to_classification_result"],
    threshold=config["threshold"],
shell:
    'python3 assessment.py -logdir {params.path_logs} -threshold {params.threshold} -out {params.path}'



input_sample = []
for ext in FASTQ_EXT:
    input_sample.extend(glob.glob(SAMPLE_PATH + "*" + ext))

rule classify_sample:
    output:
        host=config["path_to_classification_result"] + "{reference}/{sample}-k{k}-host.fq.gz",
        # other output files: -graft, -both, -neither, -ambiguous
        log=config["path_to_classification_result"] + "{reference}/{sample}-k{k}.log",
        cue_file=touch("classification.done")
    input:
        index_info=config["path_to_index_result"] + "xengsort-{reference}-k{k}.info",
        index_hash=config["path_to_index_result"] + "xengsort-{reference}-k{k}.hash",
        in1=input_sample
    log:
        config["path_to_classification_result"] + "{reference}/{sample}-k{k}.log",
    benchmark:
        config["path_to_classification_result"] + "{reference}/{sample}-k{k}.benchmark"
    params:
        index=lambda wc: f"{DIRECTORY_OUTPUT}index_result/xengsort-{wc.reference}-k{wc.k}",
        outprefix=lambda wc: f"{DIRECTORY_OUTPUT}classification_result/{wc.reference}/{wc.sample}-k{wc.k}",
        chunksize=16.0,  # MB per chunk per thread
        #numactl="numactl -N 1 -m 1 ",
        prefetch=1,
        debug="-DD",  # debug level 2
    threads: 40
    shell:
        " xengsort {params.debug} classify --out {params.outprefix} --index {params.index} "
        " --threads {threads} --fastq {input.in1} "
        " --chunksize {params.chunksize} --prefetchlevel {params.prefetch} &> {log}"

Thank you.

Best,
Fadlilah

snakemake • 476 views
ADD COMMENT
0
Entering edit mode

your classify_sample rule is just an implicit wildcard rule. How is Snakemake supposed to know what explicit files it should produce if you don't specify an explicit output(s) like mypath/hg38/sample1-k2-host.fq.gz

ADD REPLY
0
Entering edit mode

it's actually chunck of the whole code. I had the wildcards specified in rule all.

ADD REPLY
0
Entering edit mode
 rule all:
    input:
        expand(config["path_to_classification_result"] + "{reference}/{sample}-k25-host.fq.gz", sample=SAMPLES, reference=REFERENCES)

which I got the SAMPLES an REFERENCES list from glob.glob()

ADD REPLY
1
Entering edit mode

i see. so classification.done will be touched on every iteration of that rule, which isn't what you want. put your touch command in rule all

ADD REPLY
0
Entering edit mode

Thank you for the advice. But I don't understand yet, how does it suppose to help ensure rule assessment to run after classify_sample?

ADD REPLY
1
Entering edit mode
rule all_samples:
    input:
        expand(config["path_to_classification_result"] + "{reference}/{sample}-k25.log", sample=SAMPLES, reference=REFERENCES)
    output:
        touch("classification.done")

your sentinel file should work. rule assessment is really your final target

ADD REPLY
1
Entering edit mode

It worked like I intended to. Thank you very much!

ADD REPLY

Login before adding your answer.

Traffic: 1698 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6