loading yaml file into snakemake
2
2
Entering edit mode
4.5 years ago
skbrimer ▴ 740

Hello hive mind,

I'm having issues moving snakemake from one sample to multiple samples. I have gotten this workflow to work for processing amplicon targets and it worked great.

rule all:
    input:
        "data/samples/LS19-3433-21_s1_de_novo"

rule seqtk_qualtiy_filter:
    input:
        "data/samples/LS19-3433-21_s1.IonXpress_035.2019-09-04T20_14_16Z.fastq"
    output:
        temp("data/samples/LS19-3433-21_s1.qtrim.fastq")
    shell:
        "seqtk trimfq -b 0.01 {input} > {output}"

rule seqtk_clip:
    input:
        "data/samples/LS19-3433-21_s1.qtrim.fastq"
    output:
        "data/samples/LS19-3433-21_s1.clean.fastq"
    shell:
        "seqtk trimfq -b20 -L 350 {input} > {output}"

rule bbnorm:
    input:
        "data/samples/LS19-3433-21_s1.clean.fastq"
    output:
        "data/samples/LS19-3433-21_s1.norm.fastq"
    shell:
        "bbnorm.sh -Xmx10g in={input} out={output} target=50"

rule spades:
    input:
        "data/samples/LS19-3433-21_s1.norm.fastq"
    output:
        "data/samples/LS19-3433-21_s1_de_novo"
    shell:
        "spades.py --iontorrent --only-assembler -s {input} -k 21,33,55,77,99,127 -o {output}"

So now I'm trying to generalize it and the documentation says I can you a config.yaml to list my samples. Which I have done

samples:
    LS19-3512-1: data/LS19-3512-1.IonXpress_060.2019-11-01T13_49_02Z.fastq
    LS19-3512-2: data/LS19-3512-2.IonXpress_061.2019-11-01T13_49_02Z.fastq
    LS19-3512-3: data/LS19-3512-3.IonXpress_062.2019-11-01T13_49_02Z.fastq
    LS19-3512-5: data/LS19-3512-5.IonXpress_063.2019-11-01T13_49_02Z.fastq
    LS19-3512-6: data/LS19-3512-6.IonXpress_064.2019-11-01T13_49_02Z.fastq
    LS19-3512-8: data/LS19-3512-8.IonXpress_085.2019-11-01T13_49_02Z.fastq
    LS19-3512-9: data/LS19-3512-9.IonXpress_086.2019-11-01T13_49_02Z.fastq

Now I have tried to load the samples in from the yaml file like the documentation shows. like so...

configfile: "config.yaml"

print("Starting amplicon analysis workflow")

rule seqtk_qualtiy_filter:
    input:
        expand("{sample}", sample=config["samples"])
    output:
        temp("data/{sample}.qtrim.fq")
    shell:
        "seqtk trimfq -b 0.01 {input} > {output}"

and get the following error

sean@LEN943:~/Desktop/cobb_reo_3512$ snakemake -s amplicon_snakefile -np
Starting amplicon analysis workflow
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.

I have also tried the lambda wildcards: config["samples"][wildcards.sample] and get the same error

I'm sure that this is an easy fix but I'm just not understanding how to do it as using the yaml/json config files will always require a wildcard/variable.

Thank you in advance for pointing me in the right direction.

Sean

Assembly software error • 6.0k views
ADD COMMENT
1
Entering edit mode

For me, this question is fine here, but the developer of snakemake prefers stackoverflow.

ADD REPLY
0
Entering edit mode

Thank you for the feedback Wouter! I will give this a try when I escape the lab today

ADD REPLY
2
Entering edit mode
4.5 years ago
skbrimer ▴ 740

Here is my solution to this and I hope it helps other people who are just starting with Snakemake. Thank you again Wouter for the input and examples!

configfile: "config.yaml"

print("Starting amplicon analysis workflow")

rule all:
    input:
        expand("data/{sample}.clean.fastq", sample = config["samples"])

rule seqtk_qtrim:
    input:
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("data/{sample}.qtrim.fastq")
    shell:
        "seqtk trimfq -q 0.01 {input} > {output}"

rule seqtk_clip:
    input:
        "data/{sample}.qtrim.fastq"
    output:
        temp("data/{sample}.clip.fastq")
    shell:
        "seqtk trimfq -b 20 -L 350 {input} > {output}"

rule seqtk_lengthfilter:
    input:
        "data/{sample}.qtrim.fastq"
    output:
        "data/{sample}.clean.fastq"
    shell:
        "seqtk seq -L 50 {input} > {output}"
ADD COMMENT
2
Entering edit mode
4.5 years ago

I'm not sure if it's the best way, but the following works for me. I use a function to get the samples:

def get_samples(wildcards):
    return config["samples"][wildcards.sample]

My first rule has as input that function, but just the function name and not the function call. To get the genome it seems to work fine when I just use the config dictionary.

rule minimap2_align:
    input:
        fq = get_samples,
        genome = config["genome"]
    output:
        "minimap2/alignment/{sample}.bam"
    params:
        sample = "{sample}"
    threads:
        8
    log:
        "logs/minimap2/{sample}.log"
    shell:
        """
        minimap2 --MD -ax map-ont -t {threads} \
         -R "@RG\\tID:{params.sample}\\tSM:{params.sample}" \
         {input.genome} {input.fq}/*.fastq.gz | \
         samtools sort -@ {threads} -o {output} - 2> {log}
        """

That's a workflow I have hacked together, so I don't know how far it is from best practices.

ADD COMMENT
1
Entering edit mode

So it has taken a day but I did get it to work and I think I understand why it works now as well. The config.yaml file gets imported in a python dictionary, but the format for the sample location makes it a nest dictionary. So you have to use the expand("{sample}", sample = config["samples"]) notation as the input for your target file in the rule all location which makes it so I can then use the lambda wildcards: config["samples"][wildcards.sample] in the first pre-process step to access the nested dictionary value.

ADD REPLY
0
Entering edit mode

That seems like a good solution! Thanks for sharing.

ADD REPLY

Login before adding your answer.

Traffic: 2536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6