Question: Using wildcards to accept paired-end reads from Snakemake
2
gravatar for moldach686
5 months ago by
moldach68630
moldach68630 wrote:

I've ran into a problem trying to update my snakemake workflow to accept input from multiple directories.

Originally I had used glob_wildcards() _within_ my Snakefile like this:

# Full path to raw files 
SAMPLES, = glob_wildcards("{sample}_R1.fastq.gz")

rule all:
    '''
        The target rule for the workflow.
    '''
    input:
        expand('fastQC/before_trim/{sample}_R1_fastqc.html', sample=SAMPLES),
        expand('fastQC/before_trim/{sample}_R1_fastqc.zip', sample=SAMPLES),

This is not ideal for the obvious reasons that it is 1) Contained within the Snakefile and 2) will only work on files from one directory (in this case the $PWD)

Rather than a user messing around with an optimized Snakefile I would like them to use a samples.txt file that contains information for the file name and file location.

For example a samples.txt file that looks like this:

sample  location
BC1217  /foo/bar/pow/fastq
470     /wam/bam/crash/data
MTG109  /penguin/fire/station/

In addition, to the samples.txt there is also a config.yaml which contains information on the reference files/locations and will _eventually_ contain choices for different rules (e.g. variant calling with Freebayes or DeepVariant) controlled with IF/ELSE statements - but that's a whole'nother post. It looks like this:

# Files
REF_GENOME: "WS265_wormbase/c_elegans.PRJNA13758.WS265.genomic.fa.gz"
GENOME_ANNOTATION: "WS265_wormbase/c_elegans.PRJNA13758.WS265.annotations.gff3"

# Tools
QC_TOOL: "fastQC"
TRIM_TOOL: "trimmomatic"
ALIGN_TOOL: "bwa"
MARKDUP_TOOL: "picard"
CALLING_TOOL: "varscan"
ANNOT_TOOL: "coovar"

I actually based this workflow on an example I found here but it doesn't work for me.

I'll post the most pertinent parts of the code here but the full Snakefile can be found at the gist

# Load Samples ----------------------------------------------------------------

import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])

rule qc_before_trim:
    input:
        r1 = expand('{FASTQ_DIR}/{sample}_R1.fastq.gz', FASTQ_DIR=sample_locations, sample=sample_names),
        r2 = expand('{FASTQ_DIR}/{sample}_R2.fastq.gz', FASTQ_DIR=sample_locations, sample=sample_names)
    output:
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R1_fastqc.html', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names),
    expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R1_fastqc.zip', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names),
    expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R2_fastqc.html', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names),
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R2_fastqc.zip', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names)
    log: expand('{LOG_DIR}/{sample}-{QC_TOOL}-qc-before-trim.log', QC_TOOL=config["QC_TOOL"], LOG_DIR=dirs_dict["LOG_DIR"], sample=sample_names)
    resources:
        mem = 1000,
        time = 30
    threads: 1
    params:
        dir = expand('{QC_DIR}/{QC_TOOL}/before_trim/', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"])
    message: """--- Quality check of raw data with FastQC before trimming."""
    shell:
        'module load fastqc/0.11.5;'
    'fastqc -o {params.dir} -f fastq {input.r1} &'
    'fastqc -o {params.dir} -f fastq {input.r2}'

I guess the problem is that expand() generates all of the combinations versus what I want is iteration over two lists. Jeremy Leipzig was kind enough to suggest the use of wildcards on Twitter.

Prior to trying this I had tried something I found from another source but it didn't work. It was a method that defined wildcards as input for each rule. Unfortunately, I cannot find the source at the moment so I'll try to take a better look in my bookmarks in a few minutes. However, I did find some code-snips to show that it looked something like this:

def get_raw_fastq(wildcards):
    ''' return a dict with the path to the raw fastq files'''
    r1 = samples_information[samples_information["sample"] == wildcards.sample]['r1']
    r2 = samples_information[samples_information["sample"] == wildcards.sample]['r2']
    return {'r1': r1, 'r2': r2}

def get_fastq(wildcards):
    ''' return the pair of compressed fastq files for a sample.

        This helper function returns the appropriate paths according to the
        config file.
    '''
    d = {}
        d['r1'] = list(samples_information[samples_information["sample"] == wildcards.sample]['r1'])
        d['r2'] = list(samples_information[samples_information["sample"] == wildcards.sample]['r2'])
    return d

I had a bit of trouble trying to grok how to change this code to make it work for my situation probably from my lack of familiarity with Python. I would appreciate if someone could help me solve this, thank you.

snakemake wgs • 862 views
ADD COMMENTlink modified 5 months ago • written 5 months ago by moldach68630
  1. The first thing I tried (before zip) was just changing the input of the rule to your suggestion.

Switch:

rule qc_before_trim:
    input:
        r1 = expand('{FASTQ_DIR}/{sample}_R1.fastq.gz', FASTQ_DIR=sample_locations, sample=sample_names),
        r2 = expand('{FASTQ_DIR}/{sample}_R2.fastq.gz', FASTQ_DIR=sample_locations, sample=sample_names)

into

rule qc_before_trim:
  input:
    r1 = "{sample}_R1.fastq.gz"
    r2 = "{sample}_R2.fastq.gz"

However, it's looking for the input in QC directory (which is the output of this rule and not the input)

snakemake -n -r
Building DAG of jobs...
MissingInputException in line 53 of /SNAKEMAKE/DEMO/Snakefile:
Missing input files for rule qc_before_trim:
QC/fastQC/before_trim/BC1217_R2.fastq.gz
QC/fastQC/before_trim/BC1217_R1.fastq.gz
  1. If I don't need to use expand() in target rules I won't - I think your suggestion improves code readability - however, it doesn't seem to be working...

  2. When you mention intermediate targets I assume you're referring to the coovar rule?

expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}/intermediate-files/{sample}_sorted_dedupped_snp_varscan.tsv.del', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names),

These are a temporary files created by coovar - I do not necessarily require them downstream.

Given what you've said in #1 & #3 (if this does work somehow), and what's in Step 6. of the Snakemake Advanced manual could I do the following to have them deleted?

temp("{sample}/intermediate-files/{sample}_sorted_dedupped_snp_varscan.tsv.del")
  1. Here is my attempt at using zip:
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_set = list(map("".join, zip(map(str,sample_locations), sample_names)))
  

I can figure out how to include them as input using expand() then but I'm not sure how to do otherwise:

rule qc_before_trim:
    input:
        r1 = expand('{sample}_R1.fastq.gz', sample=samples_set),
        r2 = expand('{sample}_R2.fastq.gz', sample=samples_set)

So to summarize, your suggestions for eliminating redundancy in the expand() statements and using zip() are wonderful and seem to do the trick - thank you!

Making the code more readable by entirely doing away with expand() would be excellent but I seem to be having some issues with this still.

ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

ok yes you can prepend your input with directories i.e r1 = os.path.join(FASTQ_DIR,"{sample}_R1.fastq.gz"). that's ok i would just avoid expand in wildcards rules.

now for the zip thing you are close. you can just use samples_set to create a lookup function def getHome(sample): return(os.path.join(dict(samples_set)[sample],sample))) and then set your directory in the input to mydir = lambda wildcards: getHome(wildcards.sample).

ADD REPLYlink modified 5 months ago • written 5 months ago by Jeremy Leipzig19k

Sorry but I'm still confused as to what I need to do.

Your suggestion is to use os.path.join() instead of expand() in wildcards rules, correct?

FASTQ_DIR hasn't be defined anywhere so I cannot just use that - I assumed you meant you need to set mydir instead?

Your def getHome(sample) function may have contained an extra ) at the end because I got an error. I've tried the following but get an error:

import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_set = list(map("".join, zip(map(str,sample_locations), sample_names)))

def getHome(sample):
 return(os.path.join(dict(samples_set)[sample],sample))

rule qc_before_trim:
    input:
        mydir = lambda wildcards: getHome(wildcards.sample),
        r1 = os.path.join(mydir, "{sample}_R1.fastq.gz"),
        r2 = os.path.join(mydir, "{sample}_R2.fastq.gz")

(snakemake) [moldach@cedar5 DEMO]$ snakemake -n -r NameError in line 96 of /scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakefile: name 'mydir' is not defined File "/scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakefile", line 96, in <module>

ADD REPLYlink written 5 months ago by moldach68630

ok there's many ways to skin a cat but this is one strategy: def getHome(sample): return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2'])

r1 = lambda wildcards: getHome(wildcards.sample)[0],

r2 = lambda wildcards: getHome(wildcards.sample)[1]

ADD REPLYlink modified 5 months ago • written 5 months ago by Jeremy Leipzig19k

I'm getting the following error:

(snakemake) [moldach@cedar1 DEMO]$ snakemake -n -r Building DAG of jobs... InputFunctionException in line 56 of /scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakef AttributeError: 'Wildcards' object has no attribute 'sample' Wildcards:

# Directories------------------------------------------------------------------

configfile: "config.yaml"

# Setting the names of all directories
dir_list = ["LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["LOGS", "BENCHMARKS", "QC", "TRIMMING", "ALIGNMENT", "MARK_DUPLICATES", "VARIANT_CALLING", "ANNOTATION"]
dirs_dict = dict(zip(dir_list, dir_names))

# Load Samples ----------------------------------------------------------------

import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_set = list(map("".join, zip(map(str,sample_locations), sample_names)))

# Rules -----------------------------------------------------------------------

rule all:
    '''
        The target rule for the workflow.
        This lists all of the final products of the workflow

    '''
    input:
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip']),

def getHome(sample): 
  return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2'])


rule qc_before_trim:
    input:
        r1 = lambda wildcards: getHome(wildcards.sample)[0],
        r2 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip'])
    log: expand('{LOG_DIR}/{sample}-{QC_TOOL}-qc-before-trim.log', QC_TOOL=config["QC_TOOL"], LOG_DIR=dirs_dict["LOG_DIR"], sample=sample_names)
    resources:
        mem = 1000,
        time = 30
    threads: 1
    params:
        dir = expand('{QC_DIR}/{QC_TOOL}/before_trim/', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"])
    message: """--- Quality check of raw data with FastQC before trimming."""
    shell: """
        module load fastqc/0.11.5;
    fastqc -o {params.dir} -f fastq {input.r1} &
    fastqc -o {params.dir} -f fastq {input.r2}
        """
ADD REPLYlink written 5 months ago by moldach68630

don't use expand in wildcard rules

ADD REPLYlink written 5 months ago by Jeremy Leipzig19k

Do you mean in the output of the rule?

output:
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip']

If so does this apply to log: and params: as well?

I have no idea how not use expand() for this statement as QC_TOOL is defined in the config file QC_TOOL=config["QC_TOOL"]

I've even tried to simplify the rule all and the output to not use expand and as you can see I'm struggling:

rule all:
    input:
        expand('{sample}_{pair}_fastqc.{ext}', sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip']),

def getHome(sample):
  return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2'])

rule qc_before_trim:
    input:
        r1 = lambda wildcards: getHome(wildcards.sample)[0],
        r2 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        ["{sample}_R1_fastqc.html".format(sample=sample) for sample in samples_set],
        ["{sample}_R1_fastqc.zip".format(sample=sample) for sample in samples_set],
        ["{sample}_R2_fastqc.html".format(sample=sample) for sample in samples_set],
        ["{sample}_R2_fastqc.zip".format(sample=sample) for sample in samples_set]
    resources:
        mem = 1000,
        time = 30
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    shell: """
        module load fastqc/0.11.5;
        fastqc -o ./ -f fastq {input.r1} &
        fastqc -o ./ -f fastq {input.r2}
        """


snakemake -n -r
Building DAG of jobs...
MissingInputException in line 23 of /scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakefile:
Missing input files for rule all:
MTG109_R2_fastqc.html
BC1217_R2_fastqc.zip
BC1217_R2_fastqc.html
MTG109_R1_fastqc.html
470_R2_fastqc.zip
470_R2_fastqc.html
BC1217_R1_fastqc.html
470_R1_fastqc.zip
MTG109_R1_fastqc.zip
470_R1_fastqc.html
BC1217_R1_fastqc.zip
MTG109_R2_fastqc.zip
ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

Okay so switching to this now produces this error:

rule all:
    input:
        expand('{sample}_{pair}_fastqc.{ext}', sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip']),

def getHome(sample):
  return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2'])

rule qc_before_trim:
    input:
        r1 = lambda wildcards: getHome(wildcards.sample)[0],
        r2 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        "{sample}_R1_fastqc.html",
        "{sample}_R1_fastqc.zip",
        "{sample}_R2_fastqc.html",
        "{sample}_R2_fastqc.zip"
    resources:
        mem = 1000,
        time = 30
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    shell: """
        module load fastqc/0.11.5;
        fastqc -o ./ -f fastq {input.r1} &
        fastqc -o ./ -f fastq {input.r2}
        """
snakemake -n -r
Building DAG of jobs...
InputFunctionException in line 36 of /SNAKEMAKE/DEMO/Snakefile:
TypeError: 'generator' object is not subscriptable
Wildcards:
sample=BC1217
ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

uhh test that getHome lookup function in isolation. dict(sample_set) should be a dictionary with the samples as keys and paths as values. the goal is to have a function that takes a sample and returns a list of 2 paths, one for each pair

ADD REPLYlink written 5 months ago by Jeremy Leipzig19k

ummm okay, so I'm not really sure how to troubleshoot/test this?

Python 3.8.0 (default, Nov 12 2019, 19:43:25)
[GCC 5.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import pandas as pd
>>> # getting the samples information (names, path to r1 & r2) from samples.txt
... samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
>>> # get a list of the sample names
... sample_names = list(samples_information['sample'])
>>> sample_locations = list(samples_information['location'])
>>> samples_set = list(map("".join, zip(map(str,sample_locations), sample_names)))
>>> samples_set
['/foo/bar/pow/BC1217', '/wam/bam/crash/470', '/penguin/fire/station/MTG109']
>>> def getHome(sample):
...   return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pa)
...
>>> getHome(sample)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sample' is not defined
>>> getHome()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: getHome() missing 1 required positional argument: 'sample'
>>> getHome(MTG109)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'MTG109' is not defined
>>> getHome(MTG109, R1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'MTG109' is not defined
>>> return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair
...
...
...
...
KeyboardInterrupt
>>>
>>> return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair
...
...
KeyboardInterrupt
>>> dict(samples_set)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: dictionary update sequence element #0 has length 77; 2 is required
>>> dict(samples_set)[sample]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: dictionary update sequence element #0 has length 77; 2 is required
>>>

Not able to run the function getHome() as-is so I tried breaking it down into sections. Calling dict(samples_set) gives this error so perhaps samples_set is wrong?

I read somewhere that the issue could be because of feeding in a list instead of a tuple to dict():

>>> dict(['A', 'b'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: dictionary update sequence element #0 has length 1; 2 is required
>>> dict([('A', 'b')])
{'A': 'b'}

Hmmm....okay....

>>> test = ['A', 'B']
>>> test
['A', 'B']
>>> test2 = tuple(test)
>>> test2
('A', 'B')
>>> dict([test2])
{'A': 'B'}

okay now that I verified this works let's try on samples_set:

>>> samples_set
['/foo/bar/pow/BC1217', '/wam/bam/crash/470', '/penguin/fire/station/MTG109']
>>> samples_set2 = tuple(samples_set)
>>> dict([samples_set2])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: dictionary update sequence element #0 has length 3; 2 is required

Omg I have no idea what's going here....

Appreciate your help figuring this out

ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

I believe one of the errors is that samples_set was not defined properly. Maybe it should look like this instead?

import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_set = zip(sample_names, sample_locations)

Now I can try:

>>> dict(samples_set)
{'BC1217': '/foo/bar/pow/', '470': '/wam/bam/crash/', 'MTG109': '/penguin/fire/station/'}

So that seems to work (although I'm not sure if the order of sample/location is correct?

Not sure how to test the getHome() function further...

>>> dict(samples_set)
{'BC1217': '/foo/bar/pow/', '470': '/wam/bam/crash/', 'MTG109': '/penguin/fire/station/'}
>>> def getHome(sample):
...   return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2'])
...
>>> getHome(0)
<generator object getHome.<locals>.<genexpr> at 0x7fde13695190>
>>> getHome(1)
<generator object getHome.<locals>.<genexpr> at 0x7fde136950b0>
>>> getHome(BC1217)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'BC1217' is not defined
>>> getHome()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: getHome() missing 1 required positional argument: 'sample'
>>> getHome(sample)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sample' is not defined
>>> samples_set = zip(sample_locations, sample_names)
>>> getHome(sample)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sample' is not defined
>>> getHome(BC1217)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'BC1217' is not defined
>>> getHome(0)
<generator object getHome.<locals>.<genexpr> at 0x7fde13695120>
>>> os.path.join(dict(samples_set)[sample]
...
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'os' is not defined
>>> import os
>>> os.path.join(dict(samples_set)[sample])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sample' is not defined
>>> os.path.join(dict(samples_set)[BC1217])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'BC1217' is not defined
ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

ok i guess you have to explicitly make sure python runs the generator (notice the list cast)

def getHome(sample):
return(list(os.path.join(sample_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))

ADD REPLYlink modified 5 months ago • written 5 months ago by Jeremy Leipzig19k

Okay so you used sample_dict here which I suppose is the same as dict(samples_set). I noticed I had the order for the key wrong earlier in samples_set so I've fixed that now.

The following change to the snakemake file:

import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_set = zip(sample_names, sample_locations)

# Rules -----------------------------------------------------------------------

rule all:
    input:
        expand('{sample}_{pair}_fastqc.{ext}', sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip']),

def getHome(sample):
  return(list(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))


rule qc_before_trim:
    input:
        r1 = lambda wildcards: getHome(wildcards.sample)[0],
        r2 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        "{sample}_R1_fastqc.html",
        "{sample}_R1_fastqc.zip",
        "{sample}_R2_fastqc.html",
        "{sample}_R2_fastqc.zip"
    resources:
        mem = 1000,
        time = 30
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    shell: """
        module load fastqc/0.11.5;
        fastqc -o ./ -f fastq {input.r1} &
        fastqc -o ./ -f fastq {input.r2}
        """

But I'm getting the error:

snakemake -n -r
Building DAG of jobs...
InputFunctionException in line 37 of /SNAKEMAKE/DEMO/Snakefile:
KeyError: 'BC1217'
Wildcards:
sample=BC1217
ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

ok what does dict(samples_set) look like? is BC1217 a key?

ADD REPLYlink written 5 months ago by Jeremy Leipzig19k
>>> dict(samples_set)
{'BC1217': '/foo/bar/pow/', '470': '/wam/bam/crash/', 'MTG109': '/penguin/fire/station/'}
ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

I think this has something to do with you mixing integers and strings in your keys (sample names). Make sure you explicitly state that 470 is a string when you build the dictionary and the Snakemake target.

ADD REPLYlink modified 5 months ago • written 5 months ago by Jeremy Leipzig19k

I've spent some time searching for this but I have no idea how to do such a thing in Python (I use R).

There was this example of how to do the opposite of what you suggested.:

The original string : {"Nikhil" : 1, "Akshat" : 2, "Akash" : 3}
The converted dictionary : {'Nikhil': 1, 'Akshat': 2, 'Akash': 3}

So I thought to myself, okay since:

>>> samples_set = zip(sample_names, sample_locations)
>>> sample_names
['BC1217', '470', 'MTG109']

I should just convert sample_names to " instead of ' but this SO post say's you cannot so I'm confused...

Could you please explain how I can do this?

ADD REPLYlink written 5 months ago by moldach68630

Actually before we get too sidetracked...

I've removed the 470 sample from the samples.txt just to simplify things - the error persists. Besides, the error was, and still is, pointing to BC1217.

Building DAG of jobs...
InputFunctionException in line 32 of /SNAKEMAKE/DEMO/Snakefile:
KeyError: 'BC1217'
Wildcards:
sample=BC1217
ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

ok what i think is going on is the generator object returned by zip is getting spent without being reset, so it is empty the next time it's evaluated. just create a dict right away and use that samples_dict = dict(zip(sample_names, sample_locations))

ADD REPLYlink written 5 months ago by Jeremy Leipzig19k

Thanks Jeremy, I think there is some progress snow:

>>> import os
>>> import pandas as pd
>>>  samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
>>> sample_names = list(samples_information['sample'])
>>> sample_locations = list(samples_information['location'])
>>> samples_set = dict(zip(sample_names, sample_locations))

>>> def getHome(sample):
...     return(list(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))
...
>>> samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
>>> sample_names = list(samples_information['sample'])
>>> sample_locations = list(samples_information['location'])
>>> samples_set = dict(zip(sample_names, sample_locations))

>>> def getHome(sample):
...   return(list(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))
...
>>> samples_set
{'BC1217': '/foo/bar/pow/', '470': '/wam/bam/crash/', 'MTG109': '/penguin/fire/station/'}
>>> getHome('BC1217')
['/foo/bar/pow/BC1217_R1.fastq.gz', '/foo/bar/pow/BC1217_R2.fastq.gz']

It looks like the getHome() function is now behaving properly - it returns the path to both R1/R2.

Unfortunately still getting an error when trying to run snakemake.

import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_set = zip(sample_names, sample_locations)

# Rules -----------------------------------------------------------------------

rule all:
    input:
        expand('{sample}_{pair}_fastqc.{ext}', sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip']),

def getHome(sample):
  return(list(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))


rule qc_before_trim:
    input:
        r1 = lambda wildcards: getHome(wildcards.sample)[0],
        r2 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        "{sample}_R1_fastqc.html",
        "{sample}_R1_fastqc.zip",
        "{sample}_R2_fastqc.html",
        "{sample}_R2_fastqc.zip"
    resources:
        mem = 1000,
        time = 30
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    shell: """
        module load fastqc/0.11.5;
fastqc -o ./ -f fastq {input.r1} &
        fastqc -o ./ -f fastq {input.r2}
        """

And the error:

snakemake -n -r
Building DAG of jobs...
InputFunctionException in line 32 of /SNAKEMAKE/DEMO/Snakefile:
KeyError: 'BC1217'
Wildcards:
sample=BC1217
ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

i don't see you using a samples_dict

ADD REPLYlink written 5 months ago by Jeremy Leipzig19k

A rose by any other name

samples_set = dict(zip(sample_names, sample_locations))

So does the function then need to be changed?

>>> samples_set = dict(zip(sample_names, sample_locations))

>>> def getHome(sample):
...     return(list(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))
...

If so, how?

Changing to the following errors:

>>> samples_dict = dict(zip(sample_names, sample_locations))
>>> def getHome(sample):
...   return(list(os.path.join(samples_dict,"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))
...
>>> getHome('BC1217')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in getHome
  File "<stdin>", line 2, in <genexpr>
  File "/home/bin/snakemake/lib/python3.8/posixpath.py", line 76, in join
    a = os.fspath(a)
TypeError: expected str, bytes or os.PathLike object, not dict
ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

samples_dict[sample]

ADD REPLYlink written 5 months ago by Jeremy Leipzig19k

Yes I had also tried that.

>>> def getHome(sample):
...   return(list(os.path.join(samples_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))
...

snakemake -n -r
Building DAG of jobs...
InputFunctionException in line 31 of /SNAKEMAKE/DEMO/Snakefile:
TypeError: 'zip' object is not subscriptable
Wildcards:
sample=BC1217
ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

i'm running out of horizontal space, let's move this discussion to https://github.com/leipzig/biostars439754

ADD REPLYlink written 5 months ago by Jeremy Leipzig19k

Worked for me.

I've made a PR regarding how to include the directories that come before "{sample}_R1_fastqc.html":

How do I tell "{QC_DIR}/{QC_TOOL}/{sample}_R1_fastqc.html" comes from QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"])?

For example, if there was no dynamically generated QC_TOOL from config=["QC_TOOL"] one could simple do: os.path.join(dirs_dict["QC_DIR"] + '/' + '/' + '{sample}_R1_fastqc.html')

However, I want to include information from the config.yaml but it doesn't accept from os.path.join(dirs_dict["QC_DIR"] + '/' + config["QC_TOOL"] + '/' + '{sample}_R1_fastqc.html'):

>>> os.path.join(dirs_dict["QC_DIR"] + '/' + config["QC_TOOL"] + '/' + '{sample}_R1_fastqc.html')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'config' is not defined

This is also redundant for both {pair} and {ext}. So what's the proper, succinct, way to do this?

ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630

I pushed another commit that uses what you propose

ADD REPLYlink written 5 months ago by Jeremy Leipzig19k

published at as a CodeOcean capsule: https://codeocean.com/capsule/4796507/tree/v1

ADD REPLYlink written 5 months ago by Jeremy Leipzig19k
3
gravatar for Jeremy Leipzig
5 months ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

OK lemme break this down into three four heuristics (conveniently numbered 1, 1, 1, and 2)

  1. Keep your implicit wildcard rules as simple as possible. Describe what they do - no more, no less. There is generally no need for expand in these as far as samples go. expand will collide with wildcards anyway which will force you to use double brackets to escape.
rule qc_before_trim:
  input:
    r1 = "{sample}_R1.fastq.gz"
    r2 = "{sample}_R2.fastq.gz"
  

that's it. there's no need to lock all these directories down when it's already specified as a parameter or in the file path itself. right?

  1. if you're going to use expand in target rules, well then really use them to their fullest capability. your target rule is way too redundant
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"],sample=sample_names,pair=['R1','R2'],ext=['html','zip']),
  
  1. you don't need intermediate targets in a target rule, either they get created and you explicitly delete them with temp() or they stick around. either way they will be created. Intermediate rules are fine for debugging, but intermediate files don't belong in final targets. So how about trying one final target with a report file and everything that leads up to that file will be the universe of intermediates?

  2. expand is not the same as python zip, it will try to generate all combinations, so if you have only certain matching pairs of target directory and files that are in a static file just iterate over them with zip and you'll be good to go

UPDATE: added a CodeOcean solution here: https://codeocean.com/capsule/4796507/tree/v1

ADD COMMENTlink modified 5 months ago • written 5 months ago by Jeremy Leipzig19k

Your sample solution using dummy-data worked great!

Since we ran out of space above, to summarize:

Accept data from a samples.txt files via:

# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_dict = dict(zip(sample_names, sample_locations))
# get number of samples
len_samples = len(sample_names)

It's best to use a function as input for each rule(s). For example, for QC and trimming you would use the following as they are paired-end:

def getHome(sample):
  return(list(os.path.join(samples_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))

The length of the input (how many samples are being processed) is determined via len_samples = len(sample_names) and for trimming it would look like:

trim_dir = os.path.join(dirs_dict["TRIM_DIR"],config["TRIM_TOOL"])
trims_locations = [trim_dir] * len_samples
trims_dict = dict(zip(sample_names, trims_locations))

def getTrims(sample):
  return(list(os.path.join(trims_dict[sample],"{0}_{1}_trim_paired.fq.gz".format(sample,pair)) for pair in ['R1','R2']))

I've done the following for post-alignment (or cases where you only have a single file input) but I'm not sure if it's the best way to do it (nonetheless it works):

align_dir = os.path.join(dirs_dict["ALIGN_DIR"],config["ALIGN_TOOL"])
aligns_locations = [align_dir] * len_samples
aligns_dict = dict(zip(sample_names, aligns_locations))

def getBams(sample):
  return(list(os.path.join(aligns_dict[sample],"{0}.bam".format(sample,pair)) for pair in ['']))

You taught me I could use temp() in the output: of Snakemake rules that way I wouldn't have to specify them all in rule all. os.path.join() should be used instead of expand() in rules - I'm really surprised the official documentation focuses so much on expand() when IMO this is as important!

temp(os.path.join(dirs_dict["ANNOT_DIR"],config["ANNOT_TOOL"],"{sample}/intermediate-files/{sample}_sorted_dedupped_snp_varscan.tsv.del"))

And finally, the only place you should use expand() should be in the rule all.

rule all:
    input:
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip'])

Thank you so much for your week of support! Hopefully this helps others in the future as well.

ADD REPLYlink modified 5 months ago • written 5 months ago by moldach68630
1

Great! Yes expand is useful for generating all combinations of outputs for target rules, but should probably be avoided in wildcard rules. The combination of lambda wildcards and some kind of lookup function is useful in your case where you have an unpredictable directory structure. Glad this finally worked out and I hope you can help others on Biostars!

ADD REPLYlink written 5 months ago by Jeremy Leipzig19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1705 users visited in the last hour