Question

Best practices to submit a multiple node job in cluster

0

Entering edit mode

5.6 years ago

cmanu • 0

Hi guys

I'm quite new to bioinformatics right now I'm trying to run my first run at my university's cluster which uses slurm to manage the queue. After submitting a job for 4 nodes I noticed that the run was using only 1% CPU and that I was not getting any output files in my working directory. After some googling, I noticed that I did not define any scratchdir and I've adapted my submission looks something like this:

#!/bin/bash
#SBATCH -n 48
#SBATCH --mem=0
#SBATCH -o %j.o
#SBATCH -e %j.e
# Run for 7 days
#SBATCH -t 07-00:00:00
#SBATCH --exclusive
#SBATCH --job-name=2py

echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running on $SLURM_NPROCS processors."
echo "Current working directory is `pwd`"

mkdir -p /scratch/$USER/$JOB
SCRATCHDIR=/scratch/$USER/$JOB

module load CP2K/6.1-foss-2019a
mpirun srun cp2k.popt -i cp2k.inp -o cp2k.out

cp -r $SCRATCHDIR .
rm -rf $SCRATCHDIR

echo "Program finished with exit code $? at: `date`"

Do you guys have any advice on how to improve it? Also this will copy now all the files to my working directory right? If you have any resources that could help me learn about this that will be quite helpful

Thanks

slurm • 2.2k views

ADD COMMENT • link updated 5.6 years ago by GenoMax 154k • written 5.6 years ago by cmanu • 0

score 2 · Answer 1 · 2020-04-24

2

Entering edit mode

5.6 years ago

GenoMax 154k

It is difficult to provide a useful answer for the question in the present form. But I will take a stab.

After submitting a job for 4 nodes I noticed that the run was using only 1% CPU and that I was not getting any output files in my working directory.

We don't know what program you are running and while you seem to be using mpirun it may or many not be appropriate for that program. Not all programs benefit from multi-threading (especially if they are not capable of using threading/parallel execution). They may also have steps that are serial where only one core may be doing the job that is necessary.

#SBATCH -n 48

Since you have also asked for exclusive access, do your cluster nodes have 48 cores on each?

After some googling, I noticed that I did not define any scratchdir

You should not have to strictly define this. If the program you are running expects there to be a scratch dir then that is one thing otherwise programs should automatically use /tmp for that purpose.

#SBATCH --mem=0

I hope that was an error since you appear to be assigning no memory to your job at all.

Since job scheduler implementations are site specific it would be best to first look if there is any local information available for this. Talk with your cluster admins/help desk to see what you can find.

ADD COMMENT • link 5.6 years ago by GenoMax 154k

0

Entering edit mode

Thanks for the answer So I've removed the --mem=0 and the --exclusive tags. The program that I'm using is CP2K which runs (or can run) using MPI. I thought that it would be helpful to define a scratchdir as when running in several nodes I'm not able to see the output files being generated

ADD REPLY • link 5.6 years ago by cmanu • 0

1

Entering edit mode

Are you certain following method of parallel job submission is correct?

mpirun srun cp2k.popt -i cp2k.inp -o cp2k.out

OpenMPI site seems to indicate a slightly different way.

Try

mpirun cp2k.popt -i cp2k.inp -o cp2k.out

then submit the script file by doing

sbatch your_script_file

ADD REPLY • link 5.6 years ago by GenoMax 154k

0

Entering edit mode

Some like to use srun in their sbatch scripts. I prefer not to. But genomax is right anyway, it should look like this.

srun mpirun ....

ADD REPLY • link 5.6 years ago by colindaven 8.1k