Question

Attempting to setup snakemake on a compute cluster

2

Entering edit mode

4.1 years ago

mknulst ▴ 20

Hi everyone! I read the excellent guide on Biostars about snakemake and though I might find some help here, it has been hard finding anyone who has even heard of snakemake.

Posting here because I am hoping someone with a little more knowledge about conda and snakemake can help me out. I am doing a bachelors project involving analyses of bacterial WGS. The research group has been talking to another research group which has developed BacDist and they are very interested in me getting it to work. Unfortunately, due to the COVID situation, the other research group is on lockdown and all of them are infection specialists so I think I can expect very little instruction from there until further notice. I haven't done anything with conda or snakemake before. Nevertheless, I like a challenge, so I got to work setting up a conda environment with all of the listed dependencies and that was no biggie.

Now to the errors - when attempting a dry run I get a ModuleNotFoundError, it appears in line 4 in the Snakefile, SeqIO should be imported from Bio, I assume from Bioperl. Now, when I make an environment using Biopython instead of Bioperl, this line works, but other errors happen and since the dependency listed only includes Bioperl, not biopython, I assume I am forced to use Bioperl.

I have used conda to install Bioperl. It would be helpful also with input on how to quickly check for the availability of SeqIO and Bio without running the entire script, and what path it should be associated with.

I tried using cpanm instead, this leads to a bunch of errors because it cannot install dependencies. I also see that it tries to install these into my home directory instead of the conda environment directory, so I might be missing something there. Anyways, I don't know if this is the path I want to head down, so I am hoping for some feedback from here before I keep flicking switches and end up having to redo everything (again).

I wonder a little bit why this snakemake script doesn't include an environment.yml? It seems like that would make it a lot more portable.

I will be very grateful for any and everything you can tell me!

conda bioperl snakemake wgs bacdist • 1.7k views

ADD COMMENT • link updated 4.1 years ago by malteherold ▴ 60 • written 4.1 years ago by mknulst ▴ 20

1

Entering edit mode

In this context, Bio and SeqIO are biopython.

It seems that SeqIO is only needed to check if the genebank file for the reference is correct. If you ensure to only provide a valid gbk file, you might as well just comment out the and the check_for_plasmids() function plus its call (lines 47-53).

ADD REPLY • link 4.1 years ago by cschu181 ★ 2.8k

score 1 · Answer 1 · 2020-04-02

Biopython is required in the start of the Snakefile and should be available before you run snakemake.

In the standard Snakefile rules modules are loaded for each rule separately. If these modules are installed on the cluster that you are working on then this should be fine, if some of the modules are not installed or you run into trouble later, you can to see if there are other versions available (e.g. module spider minimap2). (If different rules require different environments (e.g. conda) or modules it can also be solved like this: https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#using-environment-modules )

I have used conda to install Bioperl. It would be helpful also with input on how to quickly check for the availability of SeqIO > and Bio without running the entire script, and what path it should be associated with.

you can quickly check if the bioperl installation works like this: https://stackoverflow.com/questions/1039107/how-can-i-check-if-a-perl-module-is-installed-on-my-system-from-the-command-line

if you create your own conda environment (or multiple depending on snippy) you maybe should use this Snakefile: https://github.com/MigleSur/BacDist/blob/master/Snakefile_non_computerome and maybe modify it accordingly

A small tip with debugging snakemake workflows is to use the -p option to see the commands so you can test them individually.

from my testing the tool:


conda create -n bacdist

conda activate bacdist

conda install -c bioconda -c conda-forge biopython perl-bioperl bwa readseq samclip bedtools freebayes vcflib perl-vcftools-vcf minimap2 seqtk snp-sites snippy vt bcftools samtools raxml snpsift snp-dists conalframeml

# add config.yaml

mv Snakefile Snakefile_tt
mv Snakefile_non_computerome Snakefile

#add 2 samples to fastq/ (I later ran into problems with raxml saying "too few species so I added a few"
#named sample1_R1.fastq.gz ..

snakemake -j4 -p

ran into a problem with snpsift:


snpsift=/services/tools/snpeff/4.3r/snpEff/SnpSift.jar
java -jar $snpsift filter "( GEN[*].GT='0/0' )" BacDist/test/vcf_calls/merged_E110047_raw.vcf > /BacDist/test/vcf_calls/E110047_diff.vcf

Error: Unable to access jarfile /services/tools/snpeff/4.3r/snpEff/SnpSift.jar

snpsift was missing from dependencies and the path to the .jar file was hardcoded in the Snakefile, you have to change this accordingly in the Snakefile (L139 L193 L252). If you installed snpsift with conda:

add snpsift=$WHEREYOUINSTALLEDCONDA/anaconda2/envs/bacdist/share/snpsift-4.3.1t-1/SnpSift.jar

a similar problem with snp-dists that was missing initially from the dependencies. change param in L329 L524 or in the shell part,

also clonalframeml can be installed with conda.

some other issues came with installations, e.g. you cant use conda install vcftools but you need perl-vcftools-vcf

for summary statistics I ran into a problem with awk:


awk: program limit exceeded: maximum number of fields size=32767
    FILENAME="-" FNR=1 NR=1

This might be something ubuntu specific (https://stackoverflow.com/questions/24292787/awk-program-limit-exceeded-maximum-number-of-fields-size-32767) and can be fixed by installing gawk.

All in all it runs through. Hope this helps, let me know if you have questions.