Question: Using wildcards to accept paired-end reads from Snakemake
2
gravatar for moldach686
5 days ago by
moldach68630
moldach68630 wrote:

I've ran into a problem trying to update my snakemake workflow to accept input from multiple directories.

Originally I had used glob_wildcards() _within_ my Snakefile like this:

# Full path to raw files 
SAMPLES, = glob_wildcards("{sample}_R1.fastq.gz")

rule all:
    '''
        The target rule for the workflow.
    '''
    input:
        expand('fastQC/before_trim/{sample}_R1_fastqc.html', sample=SAMPLES),
        expand('fastQC/before_trim/{sample}_R1_fastqc.zip', sample=SAMPLES),

This is not ideal for the obvious reasons that it is 1) Contained within the Snakefile and 2) will only work on files from one directory (in this case the $PWD)

Rather than a user messing around with an optimized Snakefile I would like them to use a samples.txt file that contains information for the file name and file location.

For example a samples.txt file that looks like this:

sample  location
BC1217  /home/moldach/projects/def-mtarailo/common/data/strains_ucalgary/fastq
470     /home/moldach/projects/def-mtarailo/tamaroi/data
MTG109  /home/moldach/projects/def-mtarailo/tamaroi/data/strains_ucalgary/fastq_MTG

In addition, to the samples.txt there is also a config.yaml which contains information on the reference files/locations and will _eventually_ contain choices for different rules (e.g. variant calling with Freebayes or DeepVariant) controlled with IF/ELSE statements - but that's a whole'nother post. It looks like this:

# Files
REF_GENOME: "/home/moldach/projects/def-mtarailo/common/indexes/WS265_wormbase/c_elegans.PRJNA13758.WS265.genomic.fa.gz"
GENOME_ANNOTATION: "/home/moldach/projects/def-mtarailo/common/indexes/WS265_wormbase/c_elegans.PRJNA13758.WS265.annotations.gff3"

# Tools
QC_TOOL: "fastQC"
TRIM_TOOL: "trimmomatic"
ALIGN_TOOL: "bwa"
MARKDUP_TOOL: "picard"
CALLING_TOOL: "varscan"
ANNOT_TOOL: "coovar"

I actually based this workflow on an example I found here but it doesn't work for me.

I'll post the most pertinent parts of the code here but the full Snakefile can be found at the gist

# Load Samples ----------------------------------------------------------------

import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])

rule qc_before_trim:
    input:
        r1 = expand('{FASTQ_DIR}/{sample}_R1.fastq.gz', FASTQ_DIR=sample_locations, sample=sample_names),
        r2 = expand('{FASTQ_DIR}/{sample}_R2.fastq.gz', FASTQ_DIR=sample_locations, sample=sample_names)
    output:
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R1_fastqc.html', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names),
    expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R1_fastqc.zip', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names),
    expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R2_fastqc.html', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names),
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_R2_fastqc.zip', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names)
    log: expand('{LOG_DIR}/{sample}-{QC_TOOL}-qc-before-trim.log', QC_TOOL=config["QC_TOOL"], LOG_DIR=dirs_dict["LOG_DIR"], sample=sample_names)
    resources:
        mem = 1000,
        time = 30
    threads: 1
    params:
        dir = expand('{QC_DIR}/{QC_TOOL}/before_trim/', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"])
    message: """--- Quality check of raw data with FastQC before trimming."""
    shell:
        'module load fastqc/0.11.5;'
    'fastqc -o {params.dir} -f fastq {input.r1} &'
    'fastqc -o {params.dir} -f fastq {input.r2}'

error

I guess the problem is that expand() generates all of the combinations versus what I want is iteration over two lists. Jeremy Leipzig was kind enough to suggest the use of wildcards on Twitter.

Prior to trying this I had tried something I found from another source but it didn't work. It was a method that defined wildcards as input for each rule. Unfortunately, I cannot find the source at the moment so I'll try to take a better look in my bookmarks in a few minutes. However, I did find some code-snips to show that it looked something like this:

def get_raw_fastq(wildcards):
    ''' return a dict with the path to the raw fastq files'''
    r1 = samples_information[samples_information["sample"] == wildcards.sample]['r1']
    r2 = samples_information[samples_information["sample"] == wildcards.sample]['r2']
    return {'r1': r1, 'r2': r2}

def get_fastq(wildcards):
    ''' return the pair of compressed fastq files for a sample.

        This helper function returns the appropriate paths according to the
        config file.
    '''
    d = {}
        d['r1'] = list(samples_information[samples_information["sample"] == wildcards.sample]['r1'])
        d['r2'] = list(samples_information[samples_information["sample"] == wildcards.sample]['r2'])
    return d

I had a bit of trouble trying to grok how to change this code to make it work for my situation probably from my lack of familiarity with Python. I would appreciate if someone could help me solve this, thank you.

snakemake wgs • 208 views
ADD COMMENTlink modified 5 days ago • written 5 days ago by moldach68630
  1. The first thing I tried (before zip) was just changing the input of the rule to your suggestion.

Switch:

rule qc_before_trim:
    input:
        r1 = expand('{FASTQ_DIR}/{sample}_R1.fastq.gz', FASTQ_DIR=sample_locations, sample=sample_names),
        r2 = expand('{FASTQ_DIR}/{sample}_R2.fastq.gz', FASTQ_DIR=sample_locations, sample=sample_names)

into

rule qc_before_trim:
  input:
    r1 = "{sample}_R1.fastq.gz"
    r2 = "{sample}_R2.fastq.gz"

However, it's looking for the input in QC directory (which is the output of this rule and not the input)

(snakemake) [moldach@cedar5 DEMO]$ snakemake -n -r
Building DAG of jobs...
MissingInputException in line 53 of /scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakefile:
Missing input files for rule qc_before_trim:
QC/fastQC/before_trim/BC1217_R2.fastq.gz
QC/fastQC/before_trim/BC1217_R1.fastq.gz
  1. If I don't need to use expand() in target rules I won't - I think your suggestion improves code readability - however, it doesn't seem to be working...

  2. When you mention intermediate targets I assume you're referring to the coovar rule?

expand('{ANNOT_DIR}/{ANNOT_TOOL}/{sample}/intermediate-files/{sample}_sorted_dedupped_snp_varscan.tsv.del', ANNOT_DIR=dirs_dict["ANNOT_DIR"], ANNOT_TOOL=config["ANNOT_TOOL"], sample=sample_names),

These are a temporary files created by coovar - I do not necessarily require them downstream.

Given what you've said in #1 & #3 (if this does work somehow), and what's in Step 6. of the Snakemake Advanced manual could I do the following to have them deleted?

temp("{sample}/intermediate-files/{sample}_sorted_dedupped_snp_varscan.tsv.del")
  1. Here is my attempt at using zip:
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_set = list(map("".join, zip(map(str,sample_locations), sample_names)))
  

I can figure out how to include them as input using expand() then but I'm not sure how to do otherwise:

rule qc_before_trim:
    input:
        r1 = expand('{sample}_R1.fastq.gz', sample=samples_set),
        r2 = expand('{sample}_R2.fastq.gz', sample=samples_set)

So to summarize, your suggestions for eliminating redundancy in the expand() statements and using zip() are wonderful and seem to do the trick - thank you!

Making the code more readable by entirely doing away with expand() would be excellent but I seem to be having some issues with this still.

ADD REPLYlink modified 5 days ago • written 5 days ago by moldach68630

ok yes you can prepend your input with directories i.e r1 = os.path.join(FASTQ_DIR,"{sample}_R1.fastq.gz"). that's ok i would just avoid expand in wildcards rules.

now for the zip thing you are close. you can just use samples_set to create a lookup function def getHome(sample): return(os.path.join(dict(samples_set)[sample],sample))) and then set your directory in the input to mydir = lambda wildcards: getHome(wildcards.sample).

ADD REPLYlink modified 5 days ago • written 5 days ago by Jeremy Leipzig19k

Sorry but I'm still confused as to what I need to do.

Your suggestion is to use os.path.join() instead of expand() in wildcards rules, correct?

FASTQ_DIR hasn't be defined anywhere so I cannot just use that - I assumed you meant you need to set mydir instead?

Your def getHome(sample) function may have contained an extra ) at the end because I got an error. I've tried the following but get an error:

import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_set = list(map("".join, zip(map(str,sample_locations), sample_names)))

def getHome(sample):
 return(os.path.join(dict(samples_set)[sample],sample))

rule qc_before_trim:
    input:
        mydir = lambda wildcards: getHome(wildcards.sample),
        r1 = os.path.join(mydir, "{sample}_R1.fastq.gz"),
        r2 = os.path.join(mydir, "{sample}_R2.fastq.gz")

(snakemake) [moldach@cedar5 DEMO]$ snakemake -n -r NameError in line 96 of /scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakefile: name 'mydir' is not defined File "/scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakefile", line 96, in <module>

ADD REPLYlink written 3 days ago by moldach68630

ok there's many ways to skin a cat but this is one strategy: def getHome(sample): return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2'])

r1 = lambda wildcards: getHome(wildcards.sample)[0],

r2 = lambda wildcards: getHome(wildcards.sample)[1]

ADD REPLYlink modified 1 day ago • written 1 day ago by Jeremy Leipzig19k

I'm getting the following error:

(snakemake) [moldach@cedar1 DEMO]$ snakemake -n -r Building DAG of jobs... InputFunctionException in line 56 of /scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakef AttributeError: 'Wildcards' object has no attribute 'sample' Wildcards:

# Directories------------------------------------------------------------------

configfile: "config.yaml"

# Setting the names of all directories
dir_list = ["LOG_DIR", "BENCHMARK_DIR", "QC_DIR", "TRIM_DIR", "ALIGN_DIR", "MARKDUP_DIR", "CALLING_DIR", "ANNOT_DIR"]
dir_names = ["LOGS", "BENCHMARKS", "QC", "TRIMMING", "ALIGNMENT", "MARK_DUPLICATES", "VARIANT_CALLING", "ANNOTATION"]
dirs_dict = dict(zip(dir_list, dir_names))

# Load Samples ----------------------------------------------------------------

import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_set = list(map("".join, zip(map(str,sample_locations), sample_names)))

# Rules -----------------------------------------------------------------------

rule all:
    '''
        The target rule for the workflow.
        This lists all of the final products of the workflow

    '''
    input:
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip']),

def getHome(sample): 
  return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2'])


rule qc_before_trim:
    input:
        r1 = lambda wildcards: getHome(wildcards.sample)[0],
        r2 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip'])
    log: expand('{LOG_DIR}/{sample}-{QC_TOOL}-qc-before-trim.log', QC_TOOL=config["QC_TOOL"], LOG_DIR=dirs_dict["LOG_DIR"], sample=sample_names)
    resources:
        mem = 1000,
        time = 30
    threads: 1
    params:
        dir = expand('{QC_DIR}/{QC_TOOL}/before_trim/', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"])
    message: """--- Quality check of raw data with FastQC before trimming."""
    shell: """
        module load fastqc/0.11.5;
    fastqc -o {params.dir} -f fastq {input.r1} &
    fastqc -o {params.dir} -f fastq {input.r2}
        """
ADD REPLYlink written 1 day ago by moldach68630

don't use expand in wildcard rules

ADD REPLYlink written 1 day ago by Jeremy Leipzig19k

Do you mean in the output of the rule?

output:
        expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"], sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip']

If so does this apply to log: and params: as well?

I have no idea how not use expand() for this statement as QC_TOOL is defined in the config file QC_TOOL=config["QC_TOOL"]

I've even tried to simplify the rule all and the output to not use expand and as you can see I'm struggling:

rule all:
    input:
        expand('{sample}_{pair}_fastqc.{ext}', sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip']),

def getHome(sample):
  return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2'])

rule qc_before_trim:
    input:
        r1 = lambda wildcards: getHome(wildcards.sample)[0],
        r2 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        ["{sample}_R1_fastqc.html".format(sample=sample) for sample in samples_set],
        ["{sample}_R1_fastqc.zip".format(sample=sample) for sample in samples_set],
        ["{sample}_R2_fastqc.html".format(sample=sample) for sample in samples_set],
        ["{sample}_R2_fastqc.zip".format(sample=sample) for sample in samples_set]
    resources:
        mem = 1000,
        time = 30
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    shell: """
        module load fastqc/0.11.5;
        fastqc -o ./ -f fastq {input.r1} &
        fastqc -o ./ -f fastq {input.r2}
        """


(snakemake) [moldach@cedar1 DEMO]$ snakemake -n -r
Building DAG of jobs...
MissingInputException in line 23 of /scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakefile:
Missing input files for rule all:
MTG109_R2_fastqc.html
BC1217_R2_fastqc.zip
BC1217_R2_fastqc.html
MTG109_R1_fastqc.html
470_R2_fastqc.zip
470_R2_fastqc.html
BC1217_R1_fastqc.html
470_R1_fastqc.zip
MTG109_R1_fastqc.zip
470_R1_fastqc.html
BC1217_R1_fastqc.zip
MTG109_R2_fastqc.zip
ADD REPLYlink modified 1 day ago • written 1 day ago by moldach68630

Okay so switching to this now produces this error:

rule all:
    input:
        expand('{sample}_{pair}_fastqc.{ext}', sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip']),

def getHome(sample):
  return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2'])

rule qc_before_trim:
    input:
        r1 = lambda wildcards: getHome(wildcards.sample)[0],
        r2 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        "{sample}_R1_fastqc.html",
        "{sample}_R1_fastqc.zip",
        "{sample}_R2_fastqc.html",
        "{sample}_R2_fastqc.zip"
    resources:
        mem = 1000,
        time = 30
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    shell: """
        module load fastqc/0.11.5;
        fastqc -o ./ -f fastq {input.r1} &
        fastqc -o ./ -f fastq {input.r2}
        """
(snakemake) [moldach@cedar1 DEMO]$ snakemake -n -r
Building DAG of jobs...
InputFunctionException in line 36 of /scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakefile:
TypeError: 'generator' object is not subscriptable
Wildcards:
sample=BC1217
ADD REPLYlink modified 1 day ago • written 1 day ago by moldach68630

uhh test that getHome lookup function in isolation. dict(sample_set) should be a dictionary with the samples as keys and paths as values. the goal is to have a function that takes a sample and returns a list of 2 paths, one for each pair

ADD REPLYlink written 1 day ago by Jeremy Leipzig19k

ummm okay, so I'm not really sure how to troubleshoot/test this?

Python 3.8.0 (default, Nov 12 2019, 19:43:25)
[GCC 5.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> import pandas as pd
>>> # getting the samples information (names, path to r1 & r2) from samples.txt
... samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
>>> # get a list of the sample names
... sample_names = list(samples_information['sample'])
>>> sample_locations = list(samples_information['location'])
>>> samples_set = list(map("".join, zip(map(str,sample_locations), sample_names)))
>>> samples_set
['/home/moldach/projects/def-mtarailo/common/data/strains_ucalgary/fastq/BC1217', '/home/moldach/pilo/tamaroi/data/470', '/home/moldach/projects/def-mtarailo/tamaroi/data/strains_ucalgary/fastq_MT
>>> def getHome(sample):
...   return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pa)
...
>>> getHome(sample)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sample' is not defined
>>> getHome()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: getHome() missing 1 required positional argument: 'sample'
>>> getHome(MTG109)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'MTG109' is not defined
>>> getHome(MTG109, R1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'MTG109' is not defined
>>> return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair
...
...
...
...
KeyboardInterrupt
>>>
>>> return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair
...
...
KeyboardInterrupt
>>> dict(samples_set)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: dictionary update sequence element #0 has length 77; 2 is required
>>> dict(samples_set)[sample]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: dictionary update sequence element #0 has length 77; 2 is required
>>>

Not able to run the function getHome() as-is so I tried breaking it down into sections. Calling dict(samples_set) gives this error so perhaps samples_set is wrong?

I read somewhere that the issue could be because of feeding in a list instead of a tuple to dict():

>>> dict(['A', 'b'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: dictionary update sequence element #0 has length 1; 2 is required
>>> dict([('A', 'b')])
{'A': 'b'}

Hmmm....okay....

>>> test = ['A', 'B']
>>> test
['A', 'B']
>>> test2 = tuple(test)
>>> test2
('A', 'B')
>>> dict([test2])
{'A': 'B'}

okay now that I verified this works let's try on samples_set:

>>> samples_set
['/home/moldach/projects/def-mtarailo/common/data/strains_ucalgary/fastq/BC1217', '/home/moldach/projects/def-mtarailo/tamaroi/data/470', '/home/moldach/projects/def-mtarailo/tamaroi/data/strains_ucalgary/fastq_MTG/MTG109']
>>> samples_set2 = tuple(samples_set)
>>> dict([samples_set2])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: dictionary update sequence element #0 has length 3; 2 is required

Omg I have no idea what's going here....

Appreciate your help figuring this out

ADD REPLYlink modified 12 hours ago • written 13 hours ago by moldach68630

I believe one of the errors is that samples_set was not defined properly. Maybe it should look like this instead?

import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_set = zip(sample_names, sample_locations)

Now I can try:

>>> dict(samples_set)
{'BC1217': '/home/moldach/projects/def-mtarailo/common/data/strains_ucalgary/fastq/', '470': '/home/moldach/projects/def-mtarailo/tamaroi/data/', 'MTG109': '/home/moldach/projects/def-mtarailo/tamaroi/data/strains_ucalgary/fastq_MTG/'}

So that seems to work (although I'm not sure if the order of sample/location is correct?

Not sure how to test the getHome() function further...

>>> dict(samples_set)
{'BC1217': '/home/moldach/projects/def-mtarailo/common/data/strains_ucalgary/fastq/', '470': '/home/moldach/projects/def-mtarailo/tamaroi/data/', 'MTG109': '/home/moldach/projects/def-mtarailo/tamaroi/data/strains_ucalgary/fastq_MTG/'}
>>> def getHome(sample):
...   return(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2'])
...
>>> getHome(0)
<generator object getHome.<locals>.<genexpr> at 0x7fde13695190>
>>> getHome(1)
<generator object getHome.<locals>.<genexpr> at 0x7fde136950b0>
>>> getHome(BC1217)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'BC1217' is not defined
>>> getHome()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: getHome() missing 1 required positional argument: 'sample'
>>> getHome(sample)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sample' is not defined
>>> samples_set = zip(sample_locations, sample_names)
>>> getHome(sample)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sample' is not defined
>>> getHome(BC1217)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'BC1217' is not defined
>>> getHome(0)
<generator object getHome.<locals>.<genexpr> at 0x7fde13695120>
>>> os.path.join(dict(samples_set)[sample]
...
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'os' is not defined
>>> import os
>>> os.path.join(dict(samples_set)[sample])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sample' is not defined
>>> os.path.join(dict(samples_set)[BC1217])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'BC1217' is not defined
ADD REPLYlink written 12 hours ago by moldach68630

ok i guess you have to explicitly make sure python runs the generator (notice the list cast)

def getHome(sample):
return(list(os.path.join(sample_dict[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))

ADD REPLYlink modified 7 hours ago • written 7 hours ago by Jeremy Leipzig19k

Okay so you used sample_dict here which I suppose is the same as dict(samples_set). I noticed I had the order for the key wrong earlier in samples_set so I've fixed that now.

The following change to the snakemake file:

import os
import pandas as pd
# getting the samples information (names, path to r1 & r2) from samples.txt
samples_information = pd.read_csv("samples.txt", sep='\t', index_col=False)
# get a list of the sample names
sample_names = list(samples_information['sample'])
sample_locations = list(samples_information['location'])
samples_set = zip(sample_names, sample_locations)

# Rules -----------------------------------------------------------------------

rule all:
    input:
        expand('{sample}_{pair}_fastqc.{ext}', sample=sample_names, pair=['R1', 'R2'], ext=['html', 'zip']),

def getHome(sample):
  return(list(os.path.join(dict(samples_set)[sample],"{0}_{1}.fastq.gz".format(sample,pair)) for pair in ['R1','R2']))


rule qc_before_trim:
    input:
        r1 = lambda wildcards: getHome(wildcards.sample)[0],
        r2 = lambda wildcards: getHome(wildcards.sample)[1]
    output:
        "{sample}_R1_fastqc.html",
        "{sample}_R1_fastqc.zip",
        "{sample}_R2_fastqc.html",
        "{sample}_R2_fastqc.zip"
    resources:
        mem = 1000,
        time = 30
    threads: 1
    message: """--- Quality check of raw data with FastQC before trimming."""
    shell: """
        module load fastqc/0.11.5;
        fastqc -o ./ -f fastq {input.r1} &
        fastqc -o ./ -f fastq {input.r2}
        """

But I'm getting the error:

(snakemake) [moldach@cedar1 DEMO]$ snakemake -n -r
Building DAG of jobs...
InputFunctionException in line 37 of /scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakefile:
KeyError: 'BC1217'
Wildcards:
sample=BC1217
ADD REPLYlink written 7 hours ago by moldach68630

ok what does dict(samples_set) look like? is BC1217 a key?

ADD REPLYlink written 6 hours ago by Jeremy Leipzig19k
>>> dict(samples_set)
{'BC1217': '/home/moldach/projects/def-mtarailo/common/data/strains_ucalgary/fastq/', '470': '/home/moldach/projects/def-mtarailo/tamaroi/data/', 'MTG109': '/home/moldach/projects/def-mtarailo/tamaroi/data/strains_ucalgary/fastq_MTG/'}
ADD REPLYlink written 5 hours ago by moldach68630

I think this has something to do with you mixing integers and strings in your keys (sample names). Make sure you explicitly state that 470 is a string when you build the dictionary and the Snakemake target.

ADD REPLYlink modified 5 hours ago • written 5 hours ago by Jeremy Leipzig19k

I've spent some time searching for this but I have no idea how to do such a thing in Python (I use R).

There was this example of how to do the opposite of what you suggested.:

The original string : {"Nikhil" : 1, "Akshat" : 2, "Akash" : 3}
The converted dictionary : {'Nikhil': 1, 'Akshat': 2, 'Akash': 3}

So I thought to myself, okay since:

>>> samples_set = zip(sample_names, sample_locations)
>>> sample_names
['BC1217', '470', 'MTG109']

I should just convert sample_names to " instead of ' but this SO post say's you cannot so I'm confused...

Could you please explain how I can do this?

ADD REPLYlink written 2 hours ago by moldach68630

Actually before we get _too_ sidetracked...

I've removed the 470 sample from the samples.txt just to simplify things - the error persists. Besides, the error was, and still is, pointing to BC1217.

Building DAG of jobs...
InputFunctionException in line 32 of /scratch/moldach/MADDOG/SNAKEMAKE/DEMO/Snakefile:
KeyError: 'BC1217'
Wildcards:
sample=BC1217
ADD REPLYlink written 2 hours ago by moldach68630
2
gravatar for Jeremy Leipzig
5 days ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

OK lemme break this down into three four heuristics (conveniently numbered 1, 1, 1, and 2)

  1. Keep your implicit wildcard rules as simple as possible. Describe what they do - no more, no less. There is generally no need for expand in these as far as samples go. expand will collide with wildcards anyway which will force you to use double brackets to escape.
rule qc_before_trim:
  input:
    r1 = "{sample}_R1.fastq.gz"
    r2 = "{sample}_R2.fastq.gz"
  

that's it. there's no need to lock all these directories down when it's already specified as a parameter or in the file path itself. right?

  1. if you're going to use expand in target rules, well then really use them to their fullest capability. your target rule is way too redundant
expand('{QC_DIR}/{QC_TOOL}/before_trim/{sample}_{pair}_fastqc.{ext}', QC_DIR=dirs_dict["QC_DIR"], QC_TOOL=config["QC_TOOL"],sample=sample_names,pair=['R1','R2'],ext=['html','zip']),
  
  1. you don't need intermediate targets in a target rule, either they get created and you explicitly delete them with temp() or they stick around. either way they will be created. Intermediate rules are fine for debugging, but intermediate files don't belong in final targets. So how about trying one final target with a report file and everything that leads up to that file will be the universe of intermediates?

  2. expand is not the same as python zip, it will try to generate all combinations, so if you have only certain matching pairs of target directory and files that are in a static file just iterate over them with zip and you'll be good to go

ADD COMMENTlink modified 5 days ago • written 5 days ago by Jeremy Leipzig19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1236 users visited in the last hour