Question

Samtools index not working in Snakemake

0

Entering edit mode

20 months ago

Begonia_pavonina ▴ 210

I am setting up a Snakemake pipeline for sequencing reads alignment and variants calling. But the samtools index rule is not activated, and the subsequent haplotype caller rule fail.

I think it is because the samtools index rule is not perceived as necessary to execute the output of rule all by Snakemake, but not sure about it.

Has anyone experienced the same issue?

import os
import snakemake.io
import glob
import pandas as pd


###### Config file and sample sheets #####
configfile: "config/config.yaml"

# Load the samples table:
table=pd.read_csv("config/table_reads.tsv", delim_whitespace=True,  header=None, index_col=False)

# Use the samples table to make lists of samples names/lists:
SAMPLES=table.loc[:, 0].to_list()
READS=["1","2"]

# The first rule (here rule all) specifies the files that you would like to create during your snakemake workflow.
rule all:
    input:
        "calls/all.vcf"
        #expand("mapped/{sample}.bam", sample=SAMPLES)


# ALIGNMENT:

rule fastqc:
    input:
        expand("reads/{sample}.{read}.fastq.gz", sample=SAMPLES, read=READS)


    output:
        html="qc/fastqc/{sample}.html",
        zip="qc/fastqc/{sample}_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
    params:
        extra = "--quiet"
    log:
        "logs/fastqc/{sample}.log"
    threads: 1
    resources:
        mem_mb = 1024
    wrapper:
        "v2.6.0/bio/fastqc"

rule bwa_index:
    input:
        config["reference"]
        #"{genome}.fasta", 
    output:
        idx=multiext("{genome}", ".amb", ".ann", ".bwt", ".pac", ".sa"),
    log:
        "logs/bwa_index/{genome}.log",
    params:
        algorithm="bwtsw",
    wrapper:
        "v2.6.0/bio/bwa/index"

# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/bwa/mem.html
rule bwa_mem:
        input:
                reads=["reads/{sample}.1.fastq.gz", "reads/{sample}.2.fastq.gz"],
                idx=multiext("genome", ".amb", ".ann", ".bwt", ".pac", ".sa"),
        output:
                "mapped/{sample}.bam",
        log:
                "logs/bwa_mem/{sample}.log",
        params:
                extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
                sorting="none",  # Can be 'none', 'samtools' or 'picard'.
                sort_order="queryname",  # Can be 'queryname' or 'coordinate'.
                sort_extra="",  # Extra args for samtools/picard.
        threads: 8
        wrapper:
                "v2.9.1/bio/bwa/mem"


# CALLING VARIANTS

# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/samtools/sort.html
rule samtools_sort:
    input:
        "mapped/{sample}.bam",
    output:
        "mapped/{sample}.sorted.bam",
    log:
        "{sample}.log",
    params:
        extra="-m 4G",
    threads: 8
    wrapper:
        "v2.10.0/bio/samtools/sort"


# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/samtools/
rule samtools_index:
    input:
        "mapped/{sample}.sorted.bam",
    output:
        "mapped/{sample}.sorted.bam.bai",
    log:
        "logs/samtools_index/{sample}.log",
    params:
        extra="",  # optional params string
    threads: 4  # This value - 1 will be sent to -@
    wrapper:
        "v2.10.0/bio/samtools/index"


# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/gatk/haplotypecaller.html
rule haplotype_caller_gvcf:
    input:
        # single or list of bam files
        bam="mapped/{sample}.sorted.bam",
        ref=config["reference"],
        # known="dbsnp.vcf"  # optional
    output:
        gvcf="calls/{sample}.g.vcf",
    #       bam="{sample}.assemb_haplo.bam",
    log:
        "logs/gatk/haplotypecaller/{sample}.log",
    params:
        extra="",  # optional
        java_opts="",  # optional
    threads: 4
    resources:
        mem_mb=1024,
    wrapper:
        "v2.6.0/bio/gatk/haplotypecaller"



# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/gatk/genomicsdbimport.html
rule genomics_db_import:
    input:
        gvcfs=expand("calls/{sample}.g.vcf", sample=SAMPLES),
    output:
        db=directory("db"),
    log:
        "logs/gatk/genomicsdbimport.log",
    params:
        intervals=config["reference"],
        db_action="create",  # optional
        extra="",  # optional
        java_opts="",  # optional
    threads: 2
    resources:
        mem_mb=lambda wildcards, input: max([input.size_mb * 1.6, 200]),
    wrapper:
        "v2.6.0/bio/gatk/genomicsdbimport"


# https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/gatk/genotypegvcfs.html
rule genotype_gvcfs:
    input:
        gvcf="db",  # combined gvcf over multiple samples
    # N.B. gvcf or genomicsdb must be specified
    # in the latter case, this is a GenomicsDB data store
        ref=config["reference"],
    output:
        vcf="calls/all.vcf",
    log:
        "logs/gatk/genotypegvcfs.log"
    params:
        extra="",  # optional
        java_opts="", # optional
    resources:
        mem_mb=1024
    wrapper:
        "v2.6.0/bio/gatk/genotypegvcfs"

snakemake samtools • 1.3k views

ADD COMMENT • link updated 20 months ago by rpolicastro 13k • written 20 months ago by Begonia_pavonina ▴ 210

score 2 · Answer 1 · 2023-11-10

2

Entering edit mode

20 months ago

rpolicastro 13k

In order for a rule to run at least one output of the rule needs to be specified either as an input to another rule, or an input in the all rule. Since you already use the bam you index in the haplotype_caller_gvcf rule, supply the bai index as an input to that rule also. If that breaks the wrapper specify it as an input to the all rule instead.

ADD COMMENT • link 20 months ago by rpolicastro 13k

1

Entering edit mode

It seems that your first solution works well, and does not break the wrapper. I have added idx="mapped/{sample}.sorted.bam.bai" to the inputs of the haplotype caller, and it works well. Thank you for the information!

ADD REPLY • link 20 months ago by Begonia_pavonina ▴ 210