Question: Snakemake - parameter file treated as a wildcard
0
gravatar for kzkedzierska
10 months ago by
kzkedzierska160
University of Oxford
kzkedzierska160 wrote:

I have written a pipeline in Snakemake. It's an ATAC-seq pipeline (bioinformatics pipeline to analyze genomics data from a specific experiment). Basically, until merging alignment step I use {sample_id} wildcard, to later switch to {sample} wildcard (merging two or more sample_ids into one sample).

working DAG here (for simplicity only one sample shown; orange and blue {sample_id}s are merged into one green {sample}

Tha all rule looks as follows:

configfile: "config.yaml"
SAMPLES_DICT = dict()

with open(config['SAMPLE_SHEET'], "r+") as fil:
    next(fil)
    for lin in fil.readlines():
        row = lin.strip("\n").split("\t")
        sample_id = row[0]
        sample_name = row[1]
        if sample_name in SAMPLES_DICT.keys():
            SAMPLES_DICT[sample_name].append(sample_id)
        else:
            SAMPLES_DICT[sample_name] = [sample_id]

SAMPLES = list(SAMPLES_DICT.keys())
SAMPLE_IDS = [sample_id for sample in SAMPLES_DICT.values() for sample_id in sample]
rule all:
    input:
        # Filtered alignment
        expand(os.path.join(config['FILTERED'], '{sample}.bam{ext}'),
               sample = SAMPLES,
               ext = ['', '.bai']),

        # multiqc report
        "multiqc_report.html"

    message:
        '\n#################### ATAC-seq pipeline #####################\n'
        'Running all necessary rules to produce complete output.\n'
        '############################################################'

_I know it's too messy, I should only leave the necessary bits, but here my understanding of snakemake fails cause I don't know what I have to keep and what I should delete._ edited: removed unnecessary input in the rule all.

This is working, to my knowledge exactly as I want.

However, I added a rule:

rule hmmratac:
    input:
        bam = os.path.join(config['FILTERED'], '{sample}.bam'),
        index = os.path.join(config['FILTERED'], '{sample}.bam.bai')
    output:
        model = os.path.join(config['HMMRATAC'], '{sample}.model'),
        gappedPeak = os.path.join(config['HMMRATAC'], '{sample}_peaks.gappedPeak'),
        summits = os.path.join(config['HMMRATAC'], '{sample}_summits.bed'),
        states = os.path.join(config['HMMRATAC'], '{sample}.bedgraph'),
        logs = os.path.join(config['HMMRATAC'], '{sample}.log'),
        sample_name = '{sample}'
    log:
        os.path.join(config['LOGS'], 'hmmratac', '{sample}.log')
    params:
        genomes = config['GENOMES'],
        blacklisted = config['BLACKLIST']
    resources:
        mem_mb = 32000
    message:
        '\n######################### Peak calling ########################\n'
        'Peak calling for {output.sample_name}\n.'
        '############################################################'
    shell:
        'HMMRATAC -Xms2g -Xmx{resources.mem_mb}m '
        '--bam {input.bam} --index {input.index} '
        '--genome {params.genome} --blacklist {params.blacklisted} '
        '--output {output.sample_name} --bedgraph true &> {log}'

And into the rule all, after filtering, before multiqc, I added:

    # Peak calling
    expand(os.path.join(config['HMMRATAC'], '{sample}.model'),
           sample = SAMPLES),

Relevant config.yaml fragments:

# Path to blacklisted regions
BLACKLIST: "/mnt/data/.../hg38.blacklist.bed"

# Path to chromosome sizes
GENOMES: "/mnt/data/.../hg38_sizes.genome"

# Path to filtered alignment
FILTERED: "alignment/filtered"

# Path to peaks
HMMRATAC: "peaks/hmmratac"

This is the error* I get (It goes on for every input and output of the rule). *Technically it's a warning but it halts execution of snakemake so I am calling it an error.

File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
WARNING:snakemake.logging:File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.

It isn't actually ... - I just didn't feel safe providing an absolute path here.

For a couple of days, I have struggled with this error. Looked through the documentation, listened to the introduction. I understand that the above description is far from perfect (it is huge bc I don't even know how to work it down to provide minimal reproducible example...) but I am desperate and hope you can be patient with me.

Any suggestions as to how to google it, where to look for an error would be much appreciated.

I've posted it on StackOverflow but was told there's more chance of actually getting a response here.

pipelines snakemake atac-seq • 271 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by kzkedzierska160

somehow Snakemake is thinking your BLACKLIST path is relative instead of absolute. When you run Snakemake are you using some working directory argument? If you are desperate maybe symlink your /mnt/data/ directory to your working directory and stop using absolute paths.

ADD REPLYlink written 10 months ago by Jeremy Leipzig19k
0
gravatar for kzkedzierska
10 months ago by
kzkedzierska160
University of Oxford
kzkedzierska160 wrote:

Ok.

Moving the sample_name = {sample} from output to params.

Why? Don't know exactly, but since snakemake operates on dynamic files, having "a file" in a way of just wildcard probably interfered with that.

ADD COMMENTlink written 10 months ago by kzkedzierska160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1051 users visited in the last hour