Question

snakemake input pattern

5

Entering edit mode

7.4 years ago

shexuan523392 ▴ 50

I am getting started with Snakemake and I have a very basic question which I couldnt find the answer in snakemake tutorial. I want to creat a pipe that could accept multiple input files, for example, several samples with different lanes. I can create the pipe that deal with just one sample, Snakefile and config.yaml are as follows :

# Snakefile
configfile: "config.yaml"

SAMPLES = []
samp_lane = []
for samp,lane in config["samples"].items():
    SAMPLES.append(samp)
    samp_lane = lane


rawdir = config["raw_dir"]
Samples=config["samples"]
lanes=config["samples"]

rule all:
    input:
        expand("mapping/{sample}/{sample}_L{lane}.bam", sample=SAMPLES,lane=samp_lane)

rule bwa_map:
    input:
        R1 = config["raw_dir"]+"/{sample}_L{lane}_R1.fq.gz",
        R2 = config["raw_dir"]+"/{sample}_L{lane}_R2.fq.gz"
    output:
        "mapping/{sample}/{sample}_L{lane}.bam"
    log:
        "log/bwa_map.log"
    params:
        ref = config["refgenome"]
    threads: 8
    shell:
        "echo {input.R1} {input.R2}  > {output} 2>{log}"

here is config file

samples:
    KPGP-00001: [1,2,3,4,5,6] 
  #  KPGP-00002: [1,2]

raw_dir: "/home/sxuan/WGS/rawdata" 

refgenome: "hg19"

This could successfully deal with one sample input，but failed while add another sample "KPGP-0002". How to modify the code to process multi samples with consistent pattern? Sorry for my english and thanks in advance.

software error next-gen Assembly alignment SNP • 10k views

ADD COMMENT • link updated 7.4 years ago by dariober 15k • written 7.4 years ago by shexuan523392 ▴ 50

0

Entering edit mode

SAMPLES = []
samp_lane = []
for samp,lane in config["samples"].items():
    SAMPLES.append(samp)
    samp_lane = lane

I'm guessing that should be samp_lane.append(samp) in the for loop?

Either way, the wildcards probably won't work as you expect them to. It might be considerably easier if you just merge your input files so you have a single file for each sample.

ADD REPLY • link 7.4 years ago by dr_bantz ▴ 110

score 4 · Answer 1 · 2018-01-31

If I understand your question correctly, you are dealing with two unrelated problems. One is how to pass multiple pairs of fastq files (from the same sample but from different lanes) to bwa. The other problem is how to make snakemake accept multiple files in input linked to a single wildcard in output.

For the first issues, you can concatenate fastq files as @dr_bantz suggests but this could mean creating copies of possibly massive files. I've been dealing with this issue recently and I wrote a script to concatenate and interleave multiple pairs of fastq files which can then be piped to bwa (or anything that accepts interleaved files from stdin). Script is here catInterleaveFastq.

For the second problem I think something along these lines would work, probably it could be even simplified (see snakemake Functions as Input Files):

def get_fq1(wildcards):
    # code that returns a list of fastq files for read 1 based on *wildcards.sample* e.g.
    return sorted(glob.glob(wildcards.sample + '*_R1_001.fastq.gz'))

def get_fq2(wildcards):
    # code that returns a list of fastq files for read 2 based on *wildcards.sample*, e.g.
    return sorted(glob.glob(wildcards.sample + '*_R2_001.fastq.gz'))

rule bwa_map:
    input:
        fq1= get_fq1,
        fq2= get_fq2,
    output:
        bam= '{sample}.bam'
    shell:
        """
        catInterleaveFastq.sh -1 {input.fq1} \
                              -2 {input.fq2} \
        | bwa mem -p genome.fa - \
        | ... possibly pipe to "samtools sort"  etc \
        > {output}
        """

Hope this helps...

EDIT: Add -p to bwa command to enable detection of interleaved input.