Question: snakemake input pattern
2
gravatar for shexuan523392
17 months ago by
shexuan52339220 wrote:

I am getting started with Snakemake and I have a very basic question which I couldnt find the answer in snakemake tutorial. I want to creat a pipe that could accept multiple input files, for example, several samples with different lanes. I can create the pipe that deal with just one sample, Snakefile and config.yaml are as follows :

# Snakefile
configfile: "config.yaml"

SAMPLES = []
samp_lane = []
for samp,lane in config["samples"].items():
    SAMPLES.append(samp)
    samp_lane = lane


rawdir = config["raw_dir"]
Samples=config["samples"]
lanes=config["samples"]

rule all:
    input:
        expand("mapping/{sample}/{sample}_L{lane}.bam", sample=SAMPLES,lane=samp_lane)

rule bwa_map:
    input:
        R1 = config["raw_dir"]+"/{sample}_L{lane}_R1.fq.gz",
        R2 = config["raw_dir"]+"/{sample}_L{lane}_R2.fq.gz"
    output:
        "mapping/{sample}/{sample}_L{lane}.bam"
    log:
        "log/bwa_map.log"
    params:
        ref = config["refgenome"]
    threads: 8
    shell:
        "echo {input.R1} {input.R2}  > {output} 2>{log}"

here is config file

samples:
    KPGP-00001: [1,2,3,4,5,6] 
  #  KPGP-00002: [1,2]

raw_dir: "/home/sxuan/WGS/rawdata" 

refgenome: "hg19"

This could successfully deal with one sample input,but failed while add another sample "KPGP-0002". How to modify the code to process multi samples with consistent pattern? Sorry for my english and thanks in advance.

• 3.2k views
ADD COMMENTlink modified 17 months ago by dariober10k • written 17 months ago by shexuan52339220
SAMPLES = []
samp_lane = []
for samp,lane in config["samples"].items():
    SAMPLES.append(samp)
    samp_lane = lane

I'm guessing that should be samp_lane.append(samp) in the for loop?

Either way, the wildcards probably won't work as you expect them to. It might be considerably easier if you just merge your input files so you have a single file for each sample.

ADD REPLYlink modified 17 months ago • written 17 months ago by dr_bantz80
2
gravatar for dariober
17 months ago by
dariober10k
WCIP | Glasgow | UK
dariober10k wrote:

If I understand your question correctly, you are dealing with two unrelated problems. One is how to pass multiple pairs of fastq files (from the same sample but from different lanes) to bwa. The other problem is how to make snakemake accept multiple files in input linked to a single wildcard in output.

For the first issues, you can concatenate fastq files as @dr_bantz suggests but this could mean creating copies of possibly massive files. I've been dealing with this issue recently and I wrote a script to concatenate and interleave multiple pairs of fastq files which can then be piped to bwa (or anything that accepts interleaved files from stdin). Script is here catInterleaveFastq.

For the second problem I think something along these lines would work, probably it could be even simplified (see snakemake Functions as Input Files):

def get_fq1(wildcards):
    # code that returns a list of fastq files for read 1 based on *wildcards.sample* e.g.
    return sorted(glob.glob(wildcards.sample + '*_R1_001.fastq.gz'))

def get_fq2(wildcards):
    # code that returns a list of fastq files for read 2 based on *wildcards.sample*, e.g.
    return sorted(glob.glob(wildcards.sample + '*_R2_001.fastq.gz'))

rule bwa_map:
    input:
        fq1= get_fq1,
        fq2= get_fq2,
    output:
        bam= '{sample}.bam'
    shell:
        """
        catInterleaveFastq.sh -1 {input.fq1} \
                              -2 {input.fq2} \
        | bwa mem -p genome.fa - \
        | ... possibly pipe to "samtools sort"  etc \
        > {output}
        """

Hope this helps...

EDIT: Add -p to bwa command to enable detection of interleaved input.

ADD COMMENTlink modified 17 months ago • written 17 months ago by dariober10k

Thanks a lot! Merging all lane data into one lane is good idea and easy to implement, but i guess there may exist a problem -- if we concat all fq into one file, mapping may cost much more time. And I just think of another interesting idea -- I wrote a scripts to classify all samples into a single directory with sample name, then I could write a shell scripts to execute a Snakefile for each sample. And the final directory tree may look like this:

wgs_test/
├── mapping
│   ├── samp1
│   │   └── sam1.bam
│   └── samp2
│       └── samp2.bam
└── rawdata
    ├── samp1
    │   ├── samp1_R1.fq
    │   └── samp1_R2.fq
    └── samp2
        ├── samp2_R1.fq
        └── samp2_R2.fq

And I am going to make it...

ADD REPLYlink modified 17 months ago • written 17 months ago by shexuan52339220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1628 users visited in the last hour