Question

snakemake how to get or create wildcards from newly generated files

1

Entering edit mode

2.6 years ago

gb ★ 2.2k

Hello all, I am very lost with snakemake and maybe someone here could help. Below a working example where I am at at this point but it need to be in a different way. In checkpoint split_to_fasta a script will create an unknown number of fasta files that need to be used as seperate blast databases. The fasta files could be pairs but this is not always the case. After blasting I want to give the pairs as input to a new rule (step 4). I want to process all pairs in parallel and thus "step 4" can only start when all blasting is finished. In rule all I hardcoded the "pair" id's for now but this need to be created based on the outputfiles of the previous rules. I may(?) need a second checkpoint but dont know how to implement that.

I could make a second pipeline that "starts" with step 4 and execute them after each other but that sounds like bad practice.

from glob import glob

rule all:
    input:
        #I added this line the make a working example
        expand("output/step4/{number}_step4.txt", number=["1","2","3","4"])

checkpoint split_to_fasta:
    input:
        seqfile = "files/seqs.csv"#just an empty file in this example
    output:
        fastas=directory("output/fasta")
    shell:
        #A python script will create below files and I dont know them beforehand.
        "mkdir -p {output.fastas} ; "
        "echo test > {output.fastas}/1_LEFT_sample_name_foo.fa ;"
        "echo test >  {output.fastas}/1_RIGHT_sample_name_foo.fa ;"
        "echo test >  {output.fastas}/2_LEFT_sample_name_spam.fa ;"
        "echo test >  {output.fastas}/2_RIGHT_sample_name_bla.fa ;"
        "echo test >  {output.fastas}/3_LEFT_sample_name_egg.fa ;"
        "echo test >  {output.fastas}/4_RIGHT_sample_name_ham.fa ;"

rule makeblastdb:
    input:
        fasta = "output/fasta/{fasta}.fa"
    output:
        blastdb = "output/makeblastdb/{fasta}_blastdb.txt",
    shell:
        "cp {input.fasta} {output.blastdb}"

rule blast:
    input:
        file = rules.makeblastdb.output.blastdb
    output:
        blast = "output/blast/{fasta}_blast.txt",
    shell:
        "cp {input.file} {output.blast}"

def aggregate_input(wildcards):
    checkpoint_output = checkpoints.split_to_fasta.get(**wildcards).output[0]
    ###dont know where to use this line correctly
    ###files = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
    return expand("output/blast/{fasta}_blast.txt", fasta=glob_wildcards(os.path.join(checkpoint_output, "{fasta}.fa")).fasta)

def get_id_files(wildcards):
    blast = glob("output/blast/"+ f"{wildcards.number}*_blast.txt")
    return sorted(blast)

rule step4:
    input:
        step3files = aggregate_input,
        idfiles = get_id_files
    output:
        step4 = "output/step4/{number}_step4.txt",
    run:
        shell("cat {input.idfiles} > {output.step4}")
        #input would be something like below (both sets need to run in parallel):
        #[2_LEFT_sample_name_spam.fa, 2_RIGHT_sample_name_bla.fa]
        #[3_LEFT_sample_name_egg.fa]
        #With some python code I can determine the left and right file, if for example RIGHT is missing I just give an empty string to the script.
        #shell(python script.py -left leftfile -right rightfile -o {output.step4})

For whoever runs into this same isseu I got a solution here: https://stackoverflow.com/questions/70543163/snakemake-checkpoint-and-create-a-expand-list-wildcards-based-on-created-output

snakemake • 2.4k views

ADD COMMENT • link 2.6 years ago by gb ★ 2.2k

1

Entering edit mode

Can you use a function to autodetect the resulting files for your expansion statement? Since snakeamake is python, you can define functions in your Snakefile, and your comment: "#With some python code I can determine the left and right file..." indicates there's some logic or priors you're aware of.

ADD REPLY • link 2.6 years ago by seidel 11k

0

Entering edit mode

Thanks for the reply, I do understand your point and you are right. I partly do it this way to learn snakemake better but realise now it is confusing for others. I got a solution on stackoverflow.

ADD REPLY • link 2.6 years ago by gb ★ 2.2k

1

Entering edit mode

I have a hard time understanding your DAG logic. Assuming contents in seqs.csv are not known until run time, you can try the following and go from there.

import os, csv

def aggregate_blast_outputs(wildcards):
    with open(checkpoints.determine_blast_outputs.get(**wildcards).output.blast_outputs) as fout:
        return [line.rstrip() for line in fout]

rule all:
    input:
        # this will trigger `makeblastdb` and `blast`
        step3files = aggregate_blast_outputs

rule create_arbitrarily_seqs:
    # i assume this is what `seqs.csv` is
    output:
        seqfile = touch('files/seqs.csv')
    run:
        with open(output.seqfile, 'w') as fout:
            writer = csv.writer(fout)
            writer.writerow(['1', 'LEFT', 'sample_name', 'foo'])
            writer.writerow(['1', 'RIGHT', 'sample_name', 'foo'])
            writer.writerow(['2', 'LEFT', 'sample_name', 'spam'])

checkpoint determine_blast_outputs:
    # i think this is what you want
    # use checkpoint to read `seqs.csv` and turn it into a bunch of blast outputs
    # checkpoint is needed here only if the workflow doesn't know the contents of `seq.csv`
    # until run time. if the contents are available at code time, using checkpoint isn't necessary.
    # it only makes the dag logic significantly harder to read
    input:
        seqfile = rules.create_arbitrarily_seqs.output.seqfile
    output:
        blast_outputs = 'files/blast_outputs.txt'
    run:
        with open(input.seqfile) as fin, open(output.blast_outputs, 'w') as fout:
            reader = csv.reader(fin)
            writer = csv.writer(fout)
            for data in reader:
                sample = '_'.join(data)
                writer.writerow([f'output/blast/{sample}_blast.txt'])

rule makeblastdb:
    input:
        fasta = 'output/fasta/{fasta}.fa'
    output:
        blastdb = touch('output/makeblastdb/{fasta}_blastdb.txt')

rule blast:
    input:
        blastdb = rules.makeblastdb.output.blastdb
    output:
        blast = touch('output/blast/{fasta}_blast.txt')

rule step4:
    # i don't know what you're trying to accomplish with `{number}`

rule _:
    # touch input files to make this example runnable
    output:
        touch('output/fasta/{fasta}.fa')

ADD REPLY • link 2.6 years ago by Eric Lim ★ 2.1k

1

Entering edit mode

Thank you for the effort, I will look at the code. It is a really different approach and helps me learning the possibilities of snakemake. I dont think it really solves my isseu but I also do realise my question is not so clear. My question is solved here: https://stackoverflow.com/questions/70543163/snakemake-checkpoint-and-create-a-expand-list-wildcards-based-on-created-output the second is example in the initial question is the working solution.

ADD REPLY • link 2.6 years ago by gb ★ 2.2k