Question

Snakemake - parameter file treated as a wildcard

0

Entering edit mode

4.2 years ago

kzkedzierska ▴ 290

I have written a pipeline in Snakemake. It's an ATAC-seq pipeline (bioinformatics pipeline to analyze genomics data from a specific experiment). Basically, until merging alignment step I use {sample_id} wildcard, to later switch to {sample} wildcard (merging two or more sample_ids into one sample).

working DAG here (for simplicity only one sample shown; orange and blue {sample_id}s are merged into one green {sample}

Tha all rule looks as follows:

configfile: "config.yaml"
SAMPLES_DICT = dict()

with open(config['SAMPLE_SHEET'], "r+") as fil:
    next(fil)
    for lin in fil.readlines():
        row = lin.strip("\n").split("\t")
        sample_id = row[0]
        sample_name = row[1]
        if sample_name in SAMPLES_DICT.keys():
            SAMPLES_DICT[sample_name].append(sample_id)
        else:
            SAMPLES_DICT[sample_name] = [sample_id]

SAMPLES = list(SAMPLES_DICT.keys())
SAMPLE_IDS = [sample_id for sample in SAMPLES_DICT.values() for sample_id in sample]
rule all:
    input:
        # Filtered alignment
        expand(os.path.join(config['FILTERED'], '{sample}.bam{ext}'),
               sample = SAMPLES,
               ext = ['', '.bai']),

        # multiqc report
        "multiqc_report.html"

    message:
        '\n#################### ATAC-seq pipeline #####################\n'
        'Running all necessary rules to produce complete output.\n'
        '############################################################'

_I know it's too messy, I should only leave the necessary bits, but here my understanding of snakemake fails cause I don't know what I have to keep and what I should delete._ edited: removed unnecessary input in the rule all.

This is working, to my knowledge exactly as I want.

However, I added a rule:

rule hmmratac:
    input:
        bam = os.path.join(config['FILTERED'], '{sample}.bam'),
        index = os.path.join(config['FILTERED'], '{sample}.bam.bai')
    output:
        model = os.path.join(config['HMMRATAC'], '{sample}.model'),
        gappedPeak = os.path.join(config['HMMRATAC'], '{sample}_peaks.gappedPeak'),
        summits = os.path.join(config['HMMRATAC'], '{sample}_summits.bed'),
        states = os.path.join(config['HMMRATAC'], '{sample}.bedgraph'),
        logs = os.path.join(config['HMMRATAC'], '{sample}.log'),
        sample_name = '{sample}'
    log:
        os.path.join(config['LOGS'], 'hmmratac', '{sample}.log')
    params:
        genomes = config['GENOMES'],
        blacklisted = config['BLACKLIST']
    resources:
        mem_mb = 32000
    message:
        '\n######################### Peak calling ########################\n'
        'Peak calling for {output.sample_name}\n.'
        '############################################################'
    shell:
        'HMMRATAC -Xms2g -Xmx{resources.mem_mb}m '
        '--bam {input.bam} --index {input.index} '
        '--genome {params.genome} --blacklist {params.blacklisted} '
        '--output {output.sample_name} --bedgraph true &> {log}'

And into the rule all, after filtering, before multiqc, I added:

    # Peak calling
    expand(os.path.join(config['HMMRATAC'], '{sample}.model'),
           sample = SAMPLES),

Relevant config.yaml fragments:

# Path to blacklisted regions
BLACKLIST: "/mnt/data/.../hg38.blacklist.bed"

# Path to chromosome sizes
GENOMES: "/mnt/data/.../hg38_sizes.genome"

# Path to filtered alignment
FILTERED: "alignment/filtered"

# Path to peaks
HMMRATAC: "peaks/hmmratac"

This is the error* I get (It goes on for every input and output of the rule). *Technically it's a warning but it halts execution of snakemake so I am calling it an error.

File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.
WARNING:snakemake.logging:File path alignment/filtered//mnt/data/.../hg38.blacklist.bed.bam contains double '/'. This is likely unintended. It can also lead to inconsistent results of the file-matching approach used by Snakemake.

It isn't actually ... - I just didn't feel safe providing an absolute path here.

For a couple of days, I have struggled with this error. Looked through the documentation, listened to the introduction. I understand that the above description is far from perfect (it is huge bc I don't even know how to work it down to provide minimal reproducible example...) but I am desperate and hope you can be patient with me.

Any suggestions as to how to google it, where to look for an error would be much appreciated.

I've posted it on StackOverflow but was told there's more chance of actually getting a response here.

snakemake ATAC-seq pipelines • 1.1k views

ADD COMMENT • link 4.2 years ago by kzkedzierska ▴ 290

0

Entering edit mode

somehow Snakemake is thinking your BLACKLIST path is relative instead of absolute. When you run Snakemake are you using some working directory argument? If you are desperate maybe symlink your /mnt/data/ directory to your working directory and stop using absolute paths.

ADD REPLY • link 4.2 years ago by Jeremy Leipzig 22k

score 0 · Answer 1 · 2020-02-06

0

Entering edit mode

4.2 years ago

kzkedzierska ▴ 290

Ok.

Moving the sample_name = {sample} from output to params.

Why? Don't know exactly, but since snakemake operates on dynamic files, having "a file" in a way of just wildcard probably interfered with that.

ADD COMMENT • link 4.2 years ago by kzkedzierska ▴ 290