Question

loading yaml file into snakemake

2

Entering edit mode

4.5 years ago

skbrimer ▴ 740

Hello hive mind,

I'm having issues moving snakemake from one sample to multiple samples. I have gotten this workflow to work for processing amplicon targets and it worked great.

rule all:
    input:
        "data/samples/LS19-3433-21_s1_de_novo"

rule seqtk_qualtiy_filter:
    input:
        "data/samples/LS19-3433-21_s1.IonXpress_035.2019-09-04T20_14_16Z.fastq"
    output:
        temp("data/samples/LS19-3433-21_s1.qtrim.fastq")
    shell:
        "seqtk trimfq -b 0.01 {input} > {output}"

rule seqtk_clip:
    input:
        "data/samples/LS19-3433-21_s1.qtrim.fastq"
    output:
        "data/samples/LS19-3433-21_s1.clean.fastq"
    shell:
        "seqtk trimfq -b20 -L 350 {input} > {output}"

rule bbnorm:
    input:
        "data/samples/LS19-3433-21_s1.clean.fastq"
    output:
        "data/samples/LS19-3433-21_s1.norm.fastq"
    shell:
        "bbnorm.sh -Xmx10g in={input} out={output} target=50"

rule spades:
    input:
        "data/samples/LS19-3433-21_s1.norm.fastq"
    output:
        "data/samples/LS19-3433-21_s1_de_novo"
    shell:
        "spades.py --iontorrent --only-assembler -s {input} -k 21,33,55,77,99,127 -o {output}"

So now I'm trying to generalize it and the documentation says I can you a config.yaml to list my samples. Which I have done

samples:
    LS19-3512-1: data/LS19-3512-1.IonXpress_060.2019-11-01T13_49_02Z.fastq
    LS19-3512-2: data/LS19-3512-2.IonXpress_061.2019-11-01T13_49_02Z.fastq
    LS19-3512-3: data/LS19-3512-3.IonXpress_062.2019-11-01T13_49_02Z.fastq
    LS19-3512-5: data/LS19-3512-5.IonXpress_063.2019-11-01T13_49_02Z.fastq
    LS19-3512-6: data/LS19-3512-6.IonXpress_064.2019-11-01T13_49_02Z.fastq
    LS19-3512-8: data/LS19-3512-8.IonXpress_085.2019-11-01T13_49_02Z.fastq
    LS19-3512-9: data/LS19-3512-9.IonXpress_086.2019-11-01T13_49_02Z.fastq

Now I have tried to load the samples in from the yaml file like the documentation shows. like so...

configfile: "config.yaml"

print("Starting amplicon analysis workflow")

rule seqtk_qualtiy_filter:
    input:
        expand("{sample}", sample=config["samples"])
    output:
        temp("data/{sample}.qtrim.fq")
    shell:
        "seqtk trimfq -b 0.01 {input} > {output}"

and get the following error

sean@LEN943:~/Desktop/cobb_reo_3512$ snakemake -s amplicon_snakefile -np
Starting amplicon analysis workflow
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.

I have also tried the lambda wildcards: config["samples"][wildcards.sample] and get the same error

I'm sure that this is an easy fix but I'm just not understanding how to do it as using the yaml/json config files will always require a wildcard/variable.

Thank you in advance for pointing me in the right direction.

Sean

Assembly software error • 6.0k views

ADD COMMENT • link 4.5 years ago by skbrimer ▴ 740

1

Entering edit mode

For me, this question is fine here, but the developer of snakemake prefers stackoverflow.

ADD REPLY • link 4.5 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you for the feedback Wouter! I will give this a try when I escape the lab today

ADD REPLY • link 4.5 years ago by skbrimer ▴ 740

2

Entering edit mode

4.5 years ago

WouterDeCoster 47k

I'm not sure if it's the best way, but the following works for me. I use a function to get the samples:

def get_samples(wildcards):
    return config["samples"][wildcards.sample]

My first rule has as input that function, but just the function name and not the function call. To get the genome it seems to work fine when I just use the config dictionary.

rule minimap2_align:
    input:
        fq = get_samples,
        genome = config["genome"]
    output:
        "minimap2/alignment/{sample}.bam"
    params:
        sample = "{sample}"
    threads:
        8
    log:
        "logs/minimap2/{sample}.log"
    shell:
        """
        minimap2 --MD -ax map-ont -t {threads} \
         -R "@RG\\tID:{params.sample}\\tSM:{params.sample}" \
         {input.genome} {input.fq}/*.fastq.gz | \
         samtools sort -@ {threads} -o {output} - 2> {log}
        """

That's a workflow I have hacked together, so I don't know how far it is from best practices.

ADD COMMENT • link 4.5 years ago by WouterDeCoster 47k

1

Entering edit mode

So it has taken a day but I did get it to work and I think I understand why it works now as well. The config.yaml file gets imported in a python dictionary, but the format for the sample location makes it a nest dictionary. So you have to use the expand("{sample}", sample = config["samples"]) notation as the input for your target file in the rule all location which makes it so I can then use the lambda wildcards: config["samples"][wildcards.sample] in the first pre-process step to access the nested dictionary value.

ADD REPLY • link 4.5 years ago by skbrimer ▴ 740

0

Entering edit mode

That seems like a good solution! Thanks for sharing.

ADD REPLY • link 4.5 years ago by WouterDeCoster 47k

score 2 · Accepted Answer · 2019-11-07

Here is my solution to this and I hope it helps other people who are just starting with Snakemake. Thank you again Wouter for the input and examples!

configfile: "config.yaml"

print("Starting amplicon analysis workflow")

rule all:
    input:
        expand("data/{sample}.clean.fastq", sample = config["samples"])

rule seqtk_qtrim:
    input:
        lambda wildcards: config["samples"][wildcards.sample]
    output:
        temp("data/{sample}.qtrim.fastq")
    shell:
        "seqtk trimfq -q 0.01 {input} > {output}"

rule seqtk_clip:
    input:
        "data/{sample}.qtrim.fastq"
    output:
        temp("data/{sample}.clip.fastq")
    shell:
        "seqtk trimfq -b 20 -L 350 {input} > {output}"

rule seqtk_lengthfilter:
    input:
        "data/{sample}.qtrim.fastq"
    output:
        "data/{sample}.clean.fastq"
    shell:
        "seqtk seq -L 50 {input} > {output}"