Question: Merging raw Illumina FASTQ files with snakemake
gravatar for Jokhe
22 months ago by
Jokhe90 wrote:


I have sequenced DNA samples with Illumina NextSeq which produces 8 FASTQ files per sample. As a part of my snakemake pipeline I would like to merge these files into two larger FASTQ files (forward/reverse) and continue working with these files. I have tried to figure out how this could be done with 'snakemake', so far with no success. My starting point and goal is similar to following example:

#Input files for tumor sample:

# Input files for normal (reference) sample

 #Desired output would be:

# Original files can be deleted after merging is complete.

I have tried to construct two 'snakemake' rules that could complete the job but I just can't figure out how I should choose the wildcards. Especially difficult parts in input files are directory names (e.g. -131216-42520895 produced by Illumina BaseSpace) and sample ID in the FASTQ name (e.g. S1 and S2 in examples).

DIRECTORY, SAMPLES, IDS = glob_wildcards("{directory}/{sample}_S{id}_L001_R1_001.fastq.gz")
DIRECTORY, PATIENTS, IDS = glob_wildcards("{directory}/{patient}normal_S{id}_L001_R1_001.fastq.gz"

rule all:
input: expand('{patient}/{sample}_R1.fastq.gz', patient=PATIENTS, sample=SAMPLES),
       expand('{patient}/{sample}_R2.fastq.gz', patient=PATIENTS, sample=SAMPLES)

rule mergeFastqR1:
input: ?
output: {patient}/{sample}_R1.fastq.gz
shell: 'cat {input} > {output}'

rule mergeFastqR2:
input: ?
output: {patient}/{sample}_R1.fastq.gz
shell: 'cat {input} > {output}'

The number of wildcards is unequal in different parts of rules which causes problems when running the pipeline. I think that the use of wildcards is the most difficult thing in 'snakemake' especially if you don't have strong background in Python. I can perform the flollowing tas with bash but I would be more than grateful if someone could suggest how this job could be done properly with snakemake. If anyone has time to help and figure out some sort of solution, I own you my gratitude.

Thank you in advance!

ADD COMMENTlink written 22 months ago by Jokhe90

Try not listing an input, just an output and then use cat {wildcard.sample}_S*_R1.fastq.gz > {output} in the shell.

N.B., bash will expand globs (*) alphabetically. If you ever do something like this outside of bash, then double check that that's being done!

ADD REPLYlink modified 22 months ago • written 22 months ago by Devon Ryan90k

Thanks Devon, it works like a charm! It was so easy solution that I didn't even think about it.

ADD REPLYlink written 22 months ago by Jokhe90

Could you be more specific ? What should we put as input then ? How will the mergeFastqR1 rule looks like ? thanks

ADD REPLYlink written 15 months ago by Nicolas Rosewick7.6k
rule mergeFastqR1:
    output: "{patient}/{patient}{diseaseState}_R1.fastq.gz"
    shell: """
        mkdir -p {wildcards.patient}
        cat {wildcards.patient}{wildcards.diseaseState}*/*_R1_001.fastq.gz > {output}

Or something along those lines

ADD REPLYlink written 15 months ago by Devon Ryan90k

Except when you have a samples with _S in the name - it will merge them. e.g. if you have samples X and X_S - them samples will end up being combined.

ADD REPLYlink written 14 months ago by pauloneill270

Sure, but since such a case wasn't present there was no reason to include it in the solution.

ADD REPLYlink written 14 months ago by Devon Ryan90k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 673 users visited in the last hour