beginner snakemake scatter gather help
1
1
Entering edit mode
5.2 years ago
Richard ▴ 590

Hi folks,

I usually use bash scripts to get things done, and I an going to try my hand at setting up some snakemake scripts to see if I can improve some automation and tracking for some currently ad-hoc analyses.

I've gone through the examples on this page: https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html

But I'm still unclear about how to handle variables to the point where I can do some scatter-gather work. For example, here is a rough outline of what I do in bash, but would like to do in snakemake.

# split the genome into chunks and variant call in parallel
bcftools/vcfutils.pl splitchr -l 1000000 GRCh37-lite.fa.fai | xargs -i echo "gatk-4.0.10.0/gatk --java-options "-Xmx4g" HaplotypeCaller -R GRCh37-lite.fa -I ./GM24385.bam -O ./.{}.vcf.gz --pcr-indel-model NONE -L {}" | parallel -j 20 --eta
# gather the VCF files back into one big VCF
vcftools_0.1.12a/bin/vcf-concat $( ls .*vcf.gz ) > concat.vcf
# sort the VCF
cat concat.vcf | vcftools_0.1.12a/bin/vcf-sort > concat.sorted.vcf

Are there any resources/examples out there showing how to do this or something similar in snakemake?

thanks, Richard

snakemake • 3.5k views
ADD COMMENT
1
Entering edit mode

The ability to do this is quite new, have a look here for the relevant documentation.

ADD REPLY
0
Entering edit mode

Based on the tutorial page suggested above I got this far, but am now getting "SyntaxError in line 13 of Snakefile: invalid syntax" Errors. I suspect it is due to my use of quotes in the shell command. I'm not really sure I'm on the right track so please send along any suggestions.

thanks RIchard

  1 rule all:
  2   input:
  3     "outDir/{sample}/{sample}.concat.sorted.vcf"
  4
  5 checkpoint variantCall:
  6   input:
  7     "input/{sample}.bam"
  8   output:
  9     rawCallsDir=directory("outDir/{sample}")
 10   threads: 96
 11   shell:
 12     "export PATH=jre1.8.0_66/bin/:$PATH;"
 13     "samtools-0.1.16/bcftools/vcfutils.pl splitchr -l 1000000 GRCh37-lite.fa.fai | xargs -i echo "gatk-4.0.10.0/gatk --java-options \"    -Xmx4g\" HaplotypeCaller -R GRCh37-lite.fa -I {input} -O outDir/{sample}/{}.vcf.gz --pcr-indel-model NONE -L {}" | parallel -j 20 --eta
 14
 15 def getVcfList(wildcards):
 16   checkpoint_output = checkpoints.variantCall.get(**wildcards).output[0]
 17   return expand("outDir/{sample}/{sample}.{i}.vcf.gz",
 18   sample=wildcards.sample,
 19   i=glob_wildcards(os/path.join(checkoint_output, "{i}.vcf.gz")).i)
 20
 21 rule vcfConcat:
 22   input:
 23     getVcfList
 24   output:
 25     "outDir/{sample}.concat.vcf"
 26   shell:
 27     "export PERL5LIB=vcftools_0.1.12a/lib/perl5/site_perl/; export PATH=/bin/tabix-0.2.6/:$PATH;"
 28     "vcftools_0.1.12a/bin/vcf-concat {input}  > {output}"
ADD REPLY
1
Entering edit mode

yes if you need to use a lot of quotes then just put your whole shell in a triple-quote block """

ADD REPLY
0
Entering edit mode

Great! That seemed to help. Now it looks like I'm struggling to understand the wildcards. In the code below I want to process each BAM files in the folder "input/" to create 1 final VCF per BAM file.

My current error is "RuleException in line 7 of /projects/rcorbettprj2/GATK4.0.10_snakemake/Snakefile: NameError: The name 'sample' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}} "

My SAMPLES variable contains the list of BAM file names that I expect, but I can't seem to pass that variable into

Thanks for everyone's help with this. I was a bit surprised while trying to learn this that there weren't more examples available.

  1 SAMPLES, = glob_wildcards("input/{sample}.bam")
  2
  3
  4 rule all:
  5   input:  expand("outDir/{sample}.concat.sorted.vcf", sample=SAMPLES)
  6
  7 checkpoint variantCall:
  8   input:
  9     "input/{sample}.bam"
 10   output:
 11     rawCallsDir=directory("outDir/{sample}")
 12   threads: 96
 13   shell:
 14     """
 15     export PATH=linux-x86_64/jre1.8.0_66/bin/:$PATH;
 16     samtools-0.1.16/bcftools/vcfutils.pl splitchr -l 1000000 GRCh37-lite.fa.fai | xargs -i echo "gatk-4.0.10.0/gatk --java-options \"-    Xmx4g\" HaplotypeCaller -R GRCh37-lite.fa -I {input} -O outDir/{sample}/{}.vcf.gz --pcr-indel-model NONE -L {}" | parallel -j 20 --eta
 17     """
 18
 19 def getVcfList(wildcards):
 20   checkpoint_output = checkpoints.variantCall.get(**wildcards).output[0]
 21   return expand("outDir/{sample}.{i}.vcf.gz",
 22   sample=wildcards.sample,
 23   i=glob_wildcards(os/path.join(checkoint_output, "{i}.vcf.gz")).i)
 24
 25 rule vcfConcat:
 26   input:
 27     getVcfList
 28   output:
 29     "outDir/{sample}.concat.vcf"
 30   shell:
 31     "export PERL5LIB=vcftools_0.1.12a/lib/perl5/site_perl/; export PATH=tabix-0.2.6/:$PATH;"
 32     "vcftools_0.1.12a/bin/vcf-concat {input}  > {output}"
 33
 34 rule vcfSort:
 35   input:
 36     "outDir/{sample}.concat.vcf"
 37   output:
 38     "outDir/{sample}.concat.sorted.vcf"
 39   shell:
 40     "export PERL5LIB=vcftools_0.1.12a/lib/perl5/site_perl/; export PATH=tabix-0.2.6/:$PATH;"
 41     "cat {intput} | vcftools_0.1.12a/bin/vcf-sort > {output}"
 42
 43
ADD REPLY
1
Entering edit mode

rawCallsDir=directory("outDir/{sample}") is not going to work (set this to a real file), nor is outDir/{sample}/{}.vcf.gz since the second braces need to be escaped so Snakemake knows not to touch them

ADD REPLY
0
Entering edit mode

Thanks Jeremy,

For the directory command, I was hoping to create a folder to hold the smaller VCF files per sample. That way it keeps all the files for each sample separate. For example, I have the following in "input" bam1.bam bam2.bam bam3.bam

For each of the bam files I wanted to create a folder in which I will create ~3000 small VCF files. The VCF files in each folder will then be merged to create one merged VCF file per starting BAM file. I was hoping to use separate folders for the temporary small files so that if another sample gets added later, the merges won't get triggered for the files that were created before. Perhaps snakemake is smart enough to take care of this with the checkpoint command?

ADD REPLY
0
Entering edit mode

Snake is smart enough if you let Snakemake keep track of the intermediates. I don't see the point of using parallel with Snakemake - that's what -j is for.

ADD REPLY
0
Entering edit mode
5.2 years ago
Ya • 0

Snakemake basically just checks whether the input file and output file exist or not. The i/o thing is decided by the bash part of a single rule. The wild card is a variable that contains a string and works as a string in the bash part of a single rule in Snakefile. Of course, we can also assign a list of strings to a wild card.

ADD COMMENT

Login before adding your answer.

Traffic: 2456 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6