Question

Snakemake wildcards overwriting one another

0

Entering edit mode

3.3 years ago

Ivan ▴ 60

I'm including a snipped of my code.

The following block generates two groups, I'm setting the the number of samples to 1 for easier debugging. The samples in groupOne would be [FOO_A] and samples in groupTwo would be [BAR_A]

num_of_samples = 1
groupOne = ["FOO_" + chr(alpha+97).capitalize() for alpha in range(0,num_of_samples)]
groupTwo = ["BAR_" + chr(alpha+97).capitalize() for alpha in range(0,num_of_samples)]

Below is my first rule (rule all) and two rules generating my reads, and two rules generating bowtie2 alignment (the index is prebuilt and not included here)

rule all:
    input:
        expand("aligned/ref_{chrom}.{gTwoSam}.bam", chrom=chromosomeNames, gTwoSam = groupTwo),
        expand("aligned/ref_{chrom}.{gOneSam}.bam", chrom=chromosomeNames, gOneSam = groupOne),

rule artillumina_simulate_ref_group1:
    input:
        "reference/{chrom}.fasta"
    output:
        "simulated/{chrom}.{gOneSam}.1.fq",
        "simulated/{chrom}.{gOneSam}.2.fq"
    params:
        name = "simulated/{chrom}.{gOneSam}."
    shell:
        """
        art_illumina -ss HS20 -f 9 -l 100 -p -m 360 -s 80 -na -i {input} -o {params.name}
        """

rule artillumina_simulate_ref_group2:
    input:
        "reference/{chrom}.fasta"
    output:
        "simulated/ref_{chrom}.{gTwoSam}.1.fq",
        "simulated/ref_{chrom}.{gTwoSam}.2.fq"
    params:
        name = "simulated/ref_{chrom}.{gTwoSam}."
    shell:
        """
        art_illumina -ss HS20 -f 9 -l 100 -p -m 360 -s 80 -na -i {input} -o {params.name}
        """

rule botwie2_align_ref_group1:
    input:
        "simulated/ref_{chrom}.{gOneSam}.1.fq",
        "simulated/ref_{chrom}.{gOneSam}.2.fq",
        "reference/{chrom}.rev.1.bt2", "reference/{chrom}.rev.2.bt2",
        "reference/{chrom}.1.bt2", "reference/{chrom}.2.bt2",
        "reference/{chrom}.3.bt2", "reference/{chrom}.4.bt2"
    params:
        btindex = "reference/{chrom}"
    output:
        "aligned/ref_{chrom}.{gOneSam}.bam"
    shell:
        """
        bowtie2 -x {params.btindex} -1 {input[0]} -2 {input[1]} | samtools view -bS > {output}
        """

rule botwie2_align_ref_group2:
    input:
        "simulated/ref_{chrom}.{gTwoSam}.1.fq",
        "simulated/ref_{chrom}.{gTwoSam}.2.fq",
        "reference/{chrom}.rev.1.bt2", "reference/{chrom}.rev.2.bt2",
        "reference/{chrom}.1.bt2", "reference/{chrom}.2.bt2",
        "reference/{chrom}.3.bt2", "reference/{chrom}.4.bt2"
    params:
        btindex = "reference/{chrom}"
    output:
        "aligned/ref_{chrom}.{gTwoSam}.bam"
    shell:
        """
        bowtie2 -x {params.btindex} -1 {input[0]} -2 {input[1]} | samtools view -bS > {output}
        """

Running the snakemake gives me the following error:

Rules botwie2_align_ref_group1 and botwie2_align_ref_group2 are ambiguous for the file aligned/ref_CM008309.1.BAR_A.bam.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
        botwie2_align_ref_group1: chrom=CM008309.1,gOneSam=BAR_A
        botwie2_align_ref_group2: chrom=CM008309.1,gTwoSam=BAR_A

So it would seem that my wildcard gTwoSam overwrites the gOneSam one.

However, modifying the rule all to collect output of artillumina_simulate_ref_group like this:

rule all:
input:
    expand("simulated/ref_{chrom}.{gOneSam}.1.fq",chrom=chromosomeNames,gOneSam = groupOne),
    expand("simulated/ref_{chrom}.{gOneSam}.2.fq",chrom=chromosomeNames,gOneSam = groupOne),
    expand("simulated/ref_{chrom}.{gTwoSam}.1.fq",chrom=chromosomeNames,gTwoSam = groupTwo),
    expand("simulated/ref_{chrom}.{gTwoSam}.2.fq",chrom=chromosomeNames,gTwoSam = groupTwo),

And running it dry mode everything works.

So why are my wildcards overwriting one another.

snakemake • 1.3k views

ADD COMMENT • link updated 3.3 years ago by raphael.B ▴ 540 • written 3.3 years ago by Ivan ▴ 60

0

Entering edit mode

All I can say is, you're trying to confuse snakemake as well as humans. Your rules are almost identical; the only difference is, FASTQ-output rules don't have input files mapped to ambiguous output wildcards while the BAM-output rules do. This is why snakemake is able to resolve the former and not the latter.

Do yourself a favor and remove the duplicates. Whatever you're trying to do with this approach can be done using a much better much less ambiguous approach.

ADD REPLY • link 3.3 years ago by Ram 45k

0

Entering edit mode

Sure enough, when I remove rule bowtie2_align_ref_group2 everything works well enough - rule also generates samples specified in groupTwo. I've extended the concept to the artillumina_simulate__ref_group. But my question is why does this happen. I was under the impression that specifying wildcards upstream in the rule all constrains samples - e.g. gOneSam can only have FOO values, and gTwoSam can only have BAR values.

ADD REPLY • link 3.3 years ago by Ivan ▴ 60

1

Entering edit mode

The outputs of your simulation rules are much the same regarding snakemake. It can deduce from your rule all that first wildcards is indeed a chromosome but the second one can either be {gOneSam} or {gTwoSam}. Indeed, a wildcard is not a global value, it is purely relative to your rule. Therefore it is not because 2 wildcards have the same name in two rules that they will share the same value. See them as unknown values that will be deduced from the outputs of a downstream rule (here your rule all).

If you really do want to run it this way, I would say replacing these two wildcards by BAR_{N} and FOO_{N} with N=[chr(alpha+97).capitalize() for alpha in range(0,num_of_samples)] should constrain the value of your wildcards.

Just as a general remark, since your rules simulate and align are doing the exact same thing for your 2 groups, it would be easier to have only one rule by step, with a wildcard {group} with group=groupOne+groupTwo.

ADD REPLY • link 3.3 years ago by raphael.B ▴ 540