Question

Merge several input files to single output file snakemake

0

Entering edit mode

14 months ago

samuel ▴ 240

Hi, I am a complete newbie at snakemake. I had a similar post here how to get around snakemake with different wildcards input to output rule but I thought I would ask a simplified question.

I have previous rule ( not shown but works well) that generates x4 (bam) files. The next rule takes these four files as inputs and needs to produce one file as output.

I have tried the following:

configfile: "config.yaml"

fc = config["flowcell"]
index = config["index"]
samplename = config["sn"]



rule all:
    input:
        expand(["result/gatk4/{fc}_L01_{index}_piped.bam",
        "result/gatk4/{fc}_L02_{index}_piped.bam",
        "result/gatk4/{fc}_L03_{index}_piped.bam",
        "result/gatk4/{fc}_L04_{index}_piped.bam", "result/gatk4/samplename_index{index}_markedduplicates.bam"], fc = fc,  index = index)
rule MarkDuplicates:
    input:
        L01="result/gatk4/{fc}_L01_{index}_piped.bam",
        L02="result/gatk4/{fc}_L02_{index}_piped.bam",
        L03="result/gatk4/{fc}_L03_{index}_piped.bam",
        L04="result/gatk4/{fc}_L04_{index}_piped.bam"
    output:
        bam="result/gatk4/samplename_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_index{index}_markedduplicates_metrics.txt"
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={input.L01} I={input.L02} I={input.L03} I= {input.L04}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """

But I get the error:

Wildcards in input files cannot be determined from output files: 'fc'

I have tried looking at similar questions like here enter link description here But as a newbie I can't quite follow the code.

I was wondering if I could make a list and use this as the input to the rule like:

test = ["result/gatk4/V350134767_L01_3_piped.bam",  "result/gatk4/V350134767_L02_3_piped.bam", "result/gatk4/V350134767_L03_3_piped.bam", "result/gatk4/V350134767_L04_3_piped.bam"]

rule MarkDuplicates:
    input:
        test
    output:
        bam="result/gatk4/samplename_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_index{index}_markedduplicates_metrics.txt"
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={test[0]} I={test[1]} I={test[2]} I={test[3]}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """

Then I could work out how to generate this list without having to hard code the sample names but I don't know if it's possible (and don't know the correct syntax)

Please help.

snakemake • 1.5k views

ADD COMMENT • link updated 14 months ago by raphael.B ▴ 520 • written 14 months ago by samuel ▴ 240

0

Entering edit mode

14 months ago

sschmeier ▴ 120

Does your outputs not just simply need to include the {fc} variable in the name?

ADD COMMENT • link 14 months ago by sschmeier ▴ 120

0

Entering edit mode

I didn't want it in the output name.....is there a way to succeed without this?

ADD REPLY • link 14 months ago by samuel ▴ 240

0

Entering edit mode

14 months ago

sschmeier ▴ 120

Input and output must have the same wildcards generally.

In your case you can circumvent using an input function that maps the {index} to the correct input filenames:

def get_inputs(wildcards):
  ind = wildcards.index
  # now you would need to get the correct fc based on ind 
  # e.g. if you have it stored in a dict you could do 
  # could be set after reading the sampletable etc...
  fc = index2fc[ind]
  results= [f"result/gatk4/{fc}_L01_{ind}_piped.bam",
            f"result/gatk4/{fc}_L02_{ind}_piped.bam",
            f"result/gatk4/{fc}_L03_{ind}_piped.bam",
            f"result/gatk4/{fc}_L04_{ind}_piped.bam"]
  return results

Then you could use it as input to MarkDuplicates.

rule MarkDuplicates:
     input:
         get_inputs
    ...

ADD COMMENT • link 14 months ago by sschmeier ▴ 120

score 1 · Accepted Answer · 2023-02-28

Something like this should work. In general it is quite weird to have a wildcard present in the input but not in the output. This occurs only if you are merging on the different values taken by this wildcard through an expand. I think your problem comes from there

rule all:
    input:
        expand( "result/gatk4/samplename_{fc}_index{index}_markedduplicates.bam", fc = fc,  index = index)


rule MarkDuplicates:
    input:
        L01="result/gatk4/{fc}_L01_{index}_piped.bam",
        L02="result/gatk4/{fc}_L02_{index}_piped.bam",
        L03="result/gatk4/{fc}_L03_{index}_piped.bam",
        L04="result/gatk4/{fc}_L04_{index}_piped.bam"
    output:
        bam="result/gatk4/samplename_{fc}_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_{fc}_index{index}_markedduplicates_metrics.txt"
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={input.L01} I={input.L02} I={input.L03} I= {input.L04}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """

To answer your second question (if I understood it well), a nicer way to do this would be something like this (not tested)

rule MarkDuplicates:
    input:
        expand("result/gatk4/{{fc}}_L{id}_{{index}}_piped.bam", id=['01','02',...])
    output:
        bam="result/gatk4/samplename_{fc}_index{index}_markedduplicates.bam",
        txt="result/gatk4/samplename_{fc}_index{index}_markedduplicates_metrics.txt"
    params:
        fmt=lambda wildcards,input: ' -I='.join(input)
    shell:
        """
        gatk --java-options '-Xmx30G' MarkDuplicates
        I={params.fmt}
        O={output.bam}
        M={output.txt}
        TMP_DIR=`pwd`/tmp
        2>{log}
        """