Entering edit mode
4 days ago
Peter Chung
▴
200
I am new to snakemake, I have around 1000 samples and I tried to use config file and snakemake script to parallel run the verifyBAMID function, and there is an error and I don't know the error.
config.yaml
samples:
120001:
vcf: "/sample/120001.vcf.gz"
bam: "/sample/120001/120001.recal.bam"
bai: "/sample/120001/120001.recal.bam.bai"
120002:
vcf: "/sample/120002.vcf.gz"
bam: "/sample/120002/120002recal.bam"
bai: "/sample/120002/120002.recal.bam.bai"
120004:
vcf: "/sample/120004.vcf.gz"
bam: "/sample/120004/120004.recal.bam"
bai: "/sample/120004/120004.recal.bam.bai"
snakemake script:
import os
import yaml
# Load the configuration file
config = yaml.safe_load(open("config.yaml"))
OUTPUT_DIR = "/output/"
# Rule to specify the final output needed for the workflow completion
rule all:
input:
expand(os.path.join(OUTPUT_DIR, "{sample}"), sample=config["samples"].keys())
# Rule to run VerifyBamID
rule verifyBAMID:
input:
bam=lambda wildcards: config["samples"][wildcards.sample]['bam'],
bai=lambda wildcards: config["samples"][wildcards.sample]['bai'],
vcf=lambda wildcards: config["samples"][wildcards.sample]['vcf'],
id=lambda wildcards: config["samples"][wildcards.sample]
output:
directory(os.path.join(OUTPUT_DIR, "{sample}"))
shell:
"""
mkdir -p {output} && \ # Create the output directory if it doesn't exist
VerifyBamID \
--bam {input.bam} \
--vcf {input.vcf} \
--smID {input.id} \
--out {output}/{wildcards.sample} \
--best 2>/dev/null
"""
when I dry run
Building DAG of jobs...
InputFunctionException in rule verifyBAMID in file /bin/ConfigVerifyBamID.smk, line 15:
Error:
KeyError: 'sample/120004.vcf.gz'
Wildcards:
sample=sample/120004.vcf.gz
Traceback:
File "/bin/ConfigVerifyBamID.smk", line 17, in <lambda>
can anyone advice ? thanks.
It looks like somehow the input file paths are getting used instead of the sample names themselves in your
all
rule, though I can't see quite how given what you have. (Is that definitely the correct white space in your config.yaml file? I can't even get that to parse with the decreasing indentation for each sample.) But even if that's fixed you'll also run into another problem sinceconfig["samples"][wildcards.sample]
will look for a value with a string as the key but the YAML will parse those samples as integers, and you'll get things likeKeyError: '120001'
instead.(Personally, I'd simplify the whole thing to just infer input paths directly in your verifyBAMID rule so you wouldn't have to use input functions and reference config structures and all that. If you want a simple working example to build on you could just have the sample names in a list and give that to the expand call, and worry about adding a full configuration file, if you even need it, later on.)