Question

Generalizing snakemake MACS2 code - WildcardError Wildcards in input files cannot be determined from output files: 'sample'

0

Entering edit mode

4.0 years ago

msimmer92 ▴ 300

I am practicing Snakemake. Wanted to make a MACS2 callpeak code, and ran into problems when I wanted to add a level of complexity (from making a simple code for the basic, narrow peak case, to trying to make one that is flexible to both broad and narrow scenarios). First, I did the following (basic narrow peak case, which works fine). (see first code below)

The problem is: when I want to add an if statement that changes the "narrowPeak" file for a "broadPeak" one if a configfile parameter says so ("peakcaller" is set to either narrow or broad), I run into the error "WildcardError in line (the one of the rule all). Wildcards in input files cannot be determined from output files: 'sample' ". Here I leave the second code: (see second code below).

I defined pkFile and pkFileAll because, as in the first code, the syntax is a bit different (in rule all you have to expand and put sample=samples, whereas in the rest you can just write {sample} without expanding). I thought that with defining such variables I would make it, but clearly I'm messing up somewhere. Any thoughts? I cannot seem to spot it. Thanks!

First code:

import pandas as pd
import numpy as np

# Read in config file parameters
configfile: 'config.yaml'
sampleFile = config['samples'] # 1st column: ChIP-treatment file basename; 2nd: input files with path

data = pd.read_csv(sampleFile, header=0, names=['sample', 'inputs']).set_index('sample', drop=False)
samples = data['sample'].unique().tolist() # sample names
inplist= data['inputs'].tolist() # the ChIP-input files

# Define output folders
testDir = "/data/Test" #dir with test data
logDir = "/myexperiment/Logs"
pcDir = "/myexperiment/Peaks"

# Target rule
rule all:
    input:
        expand(f'{pcDir}' + '/{sample}_peaks.xls', sample=samples),
        expand(f'{pcDir}' + '/{sample}_peaks.narrowPeak',sample=samples) 

# Peak calling with MACS2
rule macs2:
    input: trt = f'{testDir}'+'/{sample}.bam', inp = expand('{inpfile}',inpfile=inplist)
    output: xls = f'{pcDir}' + '/{sample}_peaks.xls' ,pk = f'{pcDir}' + '/{sample}_peaks.narrowPeak'
    params: nm = '{sample}' ,
        pcout = pcDir , 
        gs = "hs" ,
        th = 0.05
    threads: 40
    conda: "macs2.yaml"
    log: f'{logDir}' + '/macs2_{sample}.log'
    shell: "macs2 callpeak -t {input.trt} -c {input.inp} -f BAM -g {params.gs} -q {params.th} --outdir {params.pcout} -n {params.nm} 2> {log}"

Second code:

import pandas as pd
import numpy as np

# Read in config file parameters
configfile: 'config.yaml'
sampleFile = config['samples'] # 1st column: ChIP-treatment file basename; 2nd: input files with path
peakCaller = config['peakcaller'] 

data = pd.read_csv(sampleFile, header=0, names=['sample', 'inputs']).set_index('sample', drop=False)
samples = data['sample'].unique().tolist() # sample names   
inplist= data['inputs'].tolist() # the ChIP-input files

# Define output folders
testDir = "/data/Test" #dir with test data
logDir =  "/myexperiment/Logs"
pcDir =  "/myexperiment/Peaks"

# Define peak calling files
if( peakCaller == "narrow" ):
    pkFile = "'" + pcDir + "/{sample}_peaks.narrowPeak'"
    pkFileAll = "expand('" + pcDir + "/{sample}_peaks.narrowPeak',sample=samples)"
if( peakCaller == "broad" ):
    pkFile = "'" + pcDir + "/{sample}_peaks.broadPeak'"
    pkFileAll = "expand('" + pcDir + "/{sample}_peaks.broadPeak',sample=samples)"

# Target rule
rule all:
    input:
        expand(f'{pcDir}' + '/{sample}_peaks.xls', sample=samples),
        f'{pkFileAll}'

# Peak calling with MACS2
rule macs2:
    input: trt = f'{testDir}'+'/{sample}.bam', inp = expand('{inpfile}',inpfile=inplist)
    output: xls = f'{pcDir}' + '/{sample}_peaks.xls' ,pk = f'{pkFile}'
    params: nm = '{sample}' ,
        pcout = pcDir , 
        gs = "hs" ,
        th = 0.05
    threads: 40
    conda: "macs2.yaml"
    log: f'{logDir}' + '/macs2_{sample}.log'
    shell: "macs2 callpeak -t {input.trt} -c {input.inp} -f BAM -g {params.gs} -q {params.th} --outdir {params.pcout} -n {params.nm} 2> {log}"

snakemake • 1.6k views

ADD COMMENT • link updated 3.9 years ago by Jeremy Leipzig 22k • written 4.0 years ago by msimmer92 ▴ 300

score 0 · Answer 1 · 2020-05-20

First, the "{sample}" wildcards don't exist outside of rules. You can pass wildcards to functions (def myfunction(wildcardsGoHereReadTheDocs)) but that is for functions where the input depends on the wildcards.

Second, you're making this more complicated than it has to be. peakcaller does exist within rules so just use it there - pk = pcDir + "/{sample}_peaks."+peakCaller+"Peak'"