Snakemake and conda directory structure
1
1
Entering edit mode
4 weeks ago
hpapoli ▴ 150

Hello,

Snakemake suggests the following structure for the project directory:

enter image description here

I have a few questions when using snakemake together with conda.

For a project called project A, I create a directory with the suggested structure in the image above. My questions are:

  1. If I have a conda environment called conda_project_A, where should I create this environment? Basically, where should I put the envs and pkgs directories of the environment if I don't want them to be in my .conda in home directory? Should I create them in the project A directory as a hidden directory called .conda_project_A containing the envs and pkgs directories?

  2. In the example, the envs directory only contains yaml files. So, for example, I create conda_project_A environment with a yaml file located in envs called conda_project_A.yaml. The yaml file contains Python and snakemake as dependencies. Then for each tool that I want to use, I add a new yaml file, such as fastqc.yaml or bwa.yaml under the envs and install them all in conda_project_A environment. Is that the right way?

I am just looking for best practices in terms of reproducibility so I'd appreciate any advice. Thanks!

management workflow snakemake conda • 567 views
ADD COMMENT
2
Entering edit mode
4 weeks ago
Michael 55k

For each tool you want to run in your workflow, create a full named conda environment e.g. micromamba create -n samtools, then activate the environment, and do micromamba env export > envs/samtools.yaml. Place the file in the envs folder of the workflow where you want to use it. Use a conda:"envs/samtools.yaml" directive wherever you need it. Then run snakemake with --use-conda.

I would not use a yaml file where only the downstream packages are listed manually, instead use the fully resolved environment.

ADD COMMENT
1
Entering edit mode

+1 but I remain unconvinced that you want a conda env for each tool or even for each rule. Doesn't that creates a mess of tiny environments? And if you want to experiment and run steps outside snakemake you have to keep switching envs. So far I have been happy with this strategy:

  • Create a fresh conda env when you start a new project

  • In a requirements.txt file add dependencies with their full version number as you go along. E.g.

samtools =1.19
hisat2 =2.2.1
...
  • When you need a new dependency update that file and run mamba install --file requirements.txt. Occasionally you may need to delete the env and recreate it from requirements.txt

  • Only when you hit a real incompatibility create a dedicated env.

Other things: use mamba, I kind of forgot about conda and better leave the base env alone.

ADD REPLY
0
Entering edit mode

How large or restricted the environment(s) need to be is very situational and a matter of taste. It may be possible to use a single environment for the whole workflow if the tools are compatible in their dependencies, e.g. samtools, and hisat. Suppose you have many different tools with diverging dependencies (e.g. python versions) or have to install non-conda packages into those. In that case, starting a new environment for each is better. The downside is this wastes a lot of space and time, a general downside of conda environments, therefore it is important to strike a balance. In relation to the amount of time and space needed for the data an analysis, it is still a neglectable factor for me.

Also, the op asked if software could be shared across environments in a single rule, which I think is not advisable even if you might find out more about your conda environments inside Snakemake using the $CONDA_... variables. Remember also, that conda environments are placed inside the workflow working directory and are not named following the same naming pattern as the ones in your own home directory but have variable names. Trying to access these names in the workflow code or trying to use software from your conda install in your home directory will render the workflow non-portable.

ADD REPLY
0
Entering edit mode

I see, this is very helpful. Where do you place the environments, for example, here, where do you create the samtools environment? And the snakemake itself, does it have its own environment? And finally, say I am running some python code in a Snakemake rule, do you then use the Python from another environment? an environment of its own?

Thank you!

ADD REPLY
0
Entering edit mode

Another question, I see here you've used micromamba, do you recommend this over conda and mamba?

ADD REPLY
0
Entering edit mode

You should almost always make a completely fresh environment and never try to cross-use programs from different environments (they have random folder names, and therefore you won't succeed with that anyway). You can use the same environment for multiple rules or even the whole workflow if that is possible. You need to have snakemake either in your base environment, in a separate environment you execute your workflow from or installed otherwise. The environments you create for rules in snakemake do not need to contain snakemake. I am using micromamba because it is the fastest to solve the environment. In snakemake, mamba is now the default and I would leave it at that, micromamba is not supported by it yet. The environments should be identical between them.

ADD REPLY

Login before adding your answer.

Traffic: 1982 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6