Question

Python and SLURM environment : clarifications needed

1

Entering edit mode

4.6 years ago

Rox ★ 1.4k

Hello everyone,

I don't really know how to put this, but I'm recently trying to improve myself in coding and making tools. I was mostly doing a lot of bash things because it was the easiest way. But I know I should review my habits to be more portable.

This was my first year working on a cluster environment (SLURM), and I think I still don't get all the concepts behind it. I was fine with just doing simple "sbatch", "srun" or "sarray" command, but I would like to understand more about how it works.

A concrete example : I want to create a python script that call a tool (minimap2) installed on our cluster. This tool can only be used after a "module load" of the appropriate module (in that case bioinfo/minimap2-2.5).

So I was giving a shot with subprocess.call and subprocess.run (from which I understood they were sort of the same but not from the same python version ?) to perfom that module load.

But it kept failing, and I could'nt find anything about it on the net. I feel like I'm doing something abberant...

I'm not even sure if what I am trying to do is relevant (slurm module load from within python ?). Can anyone correct me and maybe explain what I'm doing wrong ? Maybe if you have any website or book to recommand to understand better this topic. I feel very nooby, but I still want to improve.

Thanks for your help, sorry if not a proper Biostar question,

Roxane

python slurm • 4.5k views

ADD COMMENT • link updated 4.5 years ago by Biostar 20 • written 4.6 years ago by Rox ★ 1.4k

1

Entering edit mode

Below are all good answers but I wonder why it shouldn't be possible to use exec() to do this.

EDIT: After a quick search, I found the doc for the module command which has this (which is not quite what I was thinking about but still uses exec()):

import os
exec(open('/usr/share/Modules/init/python.py').read())
module('load', 'modulefile', 'modulefile', '...')

ADD REPLY • link 4.6 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

4.6 years ago

h.mon 35k

I would argue bash is the most portable solution, provided you don't do local customization everywhere around your scripts. It is not the most efficient, and for this reason ( as steve said) several workflow managers have been developed.

I'm not even sure if what I am trying to do is relevant (slurm module load from within python ?).

As per your python code: you can hardcode the minimap path into your script, which would work but for your cluster only, or you could create a command-line option for passing the path to minimap. Or, of course, your script could use whatever minimap is first on PATH, and you just have to load the minimap module before running your script - I think this is the mosyt natural solution.

ADD COMMENT • link 4.6 years ago by h.mon 35k

0

Entering edit mode

Thank you for your answer !

Yeah the solution you proposed is the one I thought about at first. But I thought it may be not efficient. I just wanted to try some others ways as I was mostly doing everything in bash calling python script...

Is it really fine to do those kind of solutions ? I struggle so much with snakemake, I think I would need proper course about it with a teacher to answer my questions.

ADD REPLY • link 4.6 years ago by Rox ★ 1.4k

0

Entering edit mode

Its definitely fine if it satisfies all your needs. As your pipelines get more complicated and need more and more customization, you will have to make your own decision on whether to keep it simple with bash/Python scripts or to offload the complexity onto a framework.

ADD REPLY • link 4.6 years ago by steve ★ 3.5k

0

Entering edit mode

4.6 years ago

ssb.pranav ▴ 10

If I get your question right, I don't think python supports loading a module from SLURM. As given in above answers,by h.mon and steve, I feel most convenient and easy way is to either use a workflow manager or simply create a sbatch script, load the required modules and then run Python. Maybe you could also create your own environment and then run the sbatch script.

I normally create my own env and do above steps. It works.

ADD COMMENT • link 4.6 years ago by ssb.pranav ▴ 10

0

Entering edit mode

Thanks for your comments ! Yeah indeed they are simpler ways to do it. I was just trying to change my habits and improve a bit !

ADD REPLY • link 4.6 years ago by Rox ★ 1.4k

score 3 · Accepted Answer · 2019-08-26

3

Entering edit mode

4.6 years ago

steve ★ 3.5k

The reason you are having trouble loading a module on the HPC with subprocess.call, etc., is because each subprocess is executed in a new, independent process. So even if you did a subprocess.call for something like "module load xyz", the loaded module will not persist into your following calls to use that program. If you really wanted to do that, you would have to do something like subprocess.call('module load xyz; run_my_program'), probably with the shell=True option enabled I believe.

Honestly your best bet is to use a workflow manager like Snakemake or Nextflow. You should not couple your scripts directly to the HPC environment you are using, because they will then become unusable on other environments. For example on Nextflow, you can easily create a pipeline to run your script on your samples, and then configure Nextflow to submit the task as a batch job on SLURM, and also configure it to first load the correct modules. Something like this would be the most portable, since it will allow you to update the execution configuration for a new system without having to modify any aspects of your pipeline's tasks.

ADD COMMENT • link 4.6 years ago by steve ★ 3.5k

0

Entering edit mode

Thank you for your comment !

Indeed, I thought about it after, the module would just be loaded within the subprocess. But as I could not even load the module (I had a weird error which I forgot to capture),I didn't thought about it further at first.

I've been giving a try to snakemake.. I did the tutorial and I thought it was really nice, but when I was trying to implement my own thing with it, I realized I had a lot of trouble understanding the input and output logic. I spend 3h trying to understand what I was doing wrong, I kept having the error "Wildcards in input files cannot be determined from output files", but maybe that's an other biostar post !

ADD REPLY • link 4.6 years ago by Rox ★ 1.4k

0

Entering edit mode

I also find it hard working out workflow logics in snakemake confusing. But there are lots of other workflow managers. We use cgatcore/ruffus, which has a much more explicit way of linking tasks togeter (you literally specify the name of one task as the input to the next), but there is also NextFlow, TOIL, and many many more.

ADD REPLY • link 4.6 years ago by i.sudbery 19k

0

Entering edit mode

Yes, I've heard of some of the ones you mentionned such as NextFlow. I think I have a general problem to understand new way of analyzing, I'm used to give my parameters to script in options for example. The snakemake logic really make me think of Ocaml language, which really fart with my brain because I'm conditionned to think in a computer point of view, not in result I want...

ADD REPLY • link 4.6 years ago by Rox ★ 1.4k

1

Entering edit mode

you are describing the difference between explicit and implicit dependency definition, as laid out in Leipzig et al 2017. I also find the "implict" way of doing things hard to get my head around.

ADD REPLY • link 4.6 years ago by i.sudbery 19k

0

Entering edit mode

If you are interested in checking out Nextflow, there are copious amounts of documentation, including example pipelines, tutorials, example patterns, and demos(disclaimer: mine)

ADD REPLY • link 4.6 years ago by steve ★ 3.5k