Question

Align split-reads with bowtie-2 using snakemake

0

Entering edit mode

2.2 years ago

Ivan ▴ 60

I want to use snakemake to do bowtie2 mapping of split read files to a reference genome, and I'd like that rule to be integrated in the general workflow. For that purpose, I first defined a rule to create a bowtie index

rule build_bowtie_index:
    input:
        referenceGenomeFasta
    output:
        expand("{name}.{index}.bt2", index=range(1,5), name = referenceGenomeName),
        expand("{name}.rev.{index}.bt2", index=range(1,3), name = referenceGenomeName)
    params:
        btindex = referenceGenomeName
    shell:
        "bowtie2-build {input} {params.btindex}"

referenceGenomeFasta is reference genome fasta file, while referenceGenomeName is the same file without .fna suffix. The goal was to take foo.fna reference genome file and produce a list of index files foo.1.bt2, foo.2.bt2., foo.rev.bt1 etc. because bowtie2 requires that index (-X flag) be the name of the folder up to final dot, so I figured I'd store the referenceGenomeName as a global configurable variable.

Now I have a list of files containing split reads in the form of sampleFoo_1.fastq,sampleFoo_2.fastq, sampleBar_1.fastq,sampleBar_2.fastq etc. When I used bash scripts for mapping, I used the command:

The rule for creating bowtie2 index is the following:

bowtie2 -x {myindex} -1 {sample}_1.fastq -2 {sample}_2.fastq | samtools view -bS > {sample}.aligned.bam

I also used parallel processing via Sun Grid Engine to speed up the computation as I have a lot of samples. Now I want to transfer the workflow into snakemake. My rule for bowtie2 mapping must include my samples and the outputs of build_bowtie_index rule. The total thing looks like this:

SAMPLES = ["Foo", "Bar"]
rule align_via_bowtie2: 
    input:
        firstPair=expand("{rawSampleFolder}/{samples}_1.fastq",samples=SAMPLES, rawSampleFolder=rawSampleFolder),
        secondPair=expand("{rawSampleFolder}/{samples}_2.fastq",samples=SAMPLES, rawSampleFolder=rawSampleFolder),
        btindex1=expand("{name}.{index}.bt2", index=range(1,5), name = referenceGenomeName),
        btindex2=expand("{name}.rev.{index}.bt2", index=range(1,3), name = referenceGenomeName)
    params:
        btindex = referenceGenomeName
    output:
        expand("{proccSampleFolder}/{samples}.aligned.bam", samples=SAMPLES, proccSampleFolder=proccSampleFolder)
    shell:
        "bowtie2 -x {params.btindex} -1 {input.firstPair} -2 {input.secondPair} | samtools view -bS -> {output}"

When I run this script with SAMPLES=["Foo"] the script works perfectly, and what's run is:

 bowtie2 -x ref_genome -1 raw_samples/Foo_1.fastq -2 raw_samples/Foo_2.fastq | samtools view -bS -> proccessed_samples/Foo.aligned.bam

But when I run the script with SAMPLES=["Foo","Bar"], what snakemake executes is:

 bowtie2 -x ref_genome -1 raw_samples/Foo_1.fastq raw_samples/Bar_1.fastq -2 raw_samples/Foo_2.fastq raw_samples/Bar_2.fastq | samtools view -bS -> proccessed_samples/Foo.aligned.bam proccessed_samples/Bar.aligned.bam

In other words, instead of mapping first Foo, and Bar second, it maps Foo Bar and I get an error. My problem is that I want to process a list of files as a separate job - or two separate lists of files containing split reads. What am I doing wrong?

snakemake bowtie2 • 1.2k views

ADD COMMENT • link 2.2 years ago by Ivan ▴ 60

score 0 · Answer 1 · 2022-02-14

After some tinkering, I found a working solution. I'm typing it below:

SAMPLES = ### a defined Python list ###
referenceGenomeName="ref_genome"

rule all:
    input:
        expand("processed_files/{sample}.bam.bai",sample=SAMPLES),
rule align_via_bowtie2:
    input:
        "raw_samples/{sample}_1.fastq",
        "raw_samples/{sample}_2.fastq",
        expand("{name}.{index}.bt2", index=range(1,5), name = referenceGenomeName),
        expand("{name}.rev.{index}.bt2", index=range(1,3), name = referenceGenomeName)
    params:
        btindex = referenceGenomeName
    output:
        temp("processed_files/{sample}.aligned.bam")
    shell:
        "bowtie2 -x {params.btindex} -1 {input[0]} -2 {input[1]} | samtools view -bS -> {output}"
rule samtools_fixmate:
    input:
        "processed_files/{sample}.aligned.bam"
    output:
        temp("processed_files/{sample}.fixmated.bam")
    shell:
        "samtools fixmate -m -O {input} {output}"
rule samtools_sort:
    input:
        "processed_files/{sample}.fixmated.bam"
    output:
        temp("processed_files/{sample}.sorted.bam")
    shell:
        "samtools sort {input} -o {output}"
rule samtools_markduplicates:
    input:
        "processed_files/{sample}.sorted.bam"
    output:
        "processed_files/{sample}.markduped.bam"
    shell:
        "samtools markdup {input} {output}"
rule samtools_index:
    input:
        "processed_files/{sample}.markduped.bam"
    output:
        "processed_files/{sample}.bam.bai",
    shell:
        """
        samtools index {input}
        """

So far, only the first rule is currently processing - I got the correct files in the processed_files directory. The rest is just further processing. I've used the following command to start parallel jobs on a cluster:

snakemake -j 5 -p --cluster "qsub -V -cwd -now y"