Question: Errors attempting to start OMA job on SLURM-managed cluster
0
gravatar for Adamc
3.1 years ago by
Adamc590
United States
Adamc590 wrote:

Hi,

I'm trying to run OMA on a cluster that is managed with SLURM. I've used this cluster before for a lot of other things, but I'm not so familiar with job arrays. Although it seems I'm doing things right as far as I can tell, OMA keeps giving only:

"ERROR: Cannot determine total number of jobs in jobarray. requires a range from 1-x"

An example of one of the SLURM scripts I've tried:

#!/bin/sh
#SBATCH --partition=standard
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=5
#SBATCH --mem-per-cpu=2gb
#SBATCH --job-name="OMA_test"
#SBATCH --output=output_%A_%a.txt
#SBATCH --error=error_%A_%a.txt
#SBATCH --array=1-5
~/OMA/bin/OMA -s

It initializes fine if I don't use SLURM, but I'm working with multiple genomes, so parallelization would be helpful. Additionally, I encountered an edge case when I was trying to run in our debug queue when I specified 10 jobs in the array, but the max simultaneous in debug is 5- so it seemed to group them hierarchically and started executing five tasks properly. Does this mean that perhaps I need to spawn a "parent" job that then submits the array jobs? I couldn't keep it running in debug because it would timeout after an hour, and so I'd probably have to restart it a bunch of times to get the whole job to complete.

The SLURM example in the doc page (http://omabrowser.org/standalone/) does seem to imply that any "parent" task would be necessary, and I didn't come across any way of manually specifying the number of array jobs to OMA manually.

Anyone with some insight on this? Thanks.

slurm oma • 1.3k views
ADD COMMENTlink modified 3.1 years ago by adrian.altenhoff520 • written 3.1 years ago by Adamc590

Have you tried instead running with sbatch --array=1-5 ...? Just don't put the array size in the script.

ADD REPLYlink written 3.1 years ago by Devon Ryan89k
1

I didn't think there would be a functional difference- just a different format of specifying parameters (script vs single-line). I guess I'll give it a shot though.

EDIT: I just tried exactly the same lines as are in the example for SLURM on the OMA site, except with 1-5 for the array and with partition specification, and still got an error about not being able to determine the number of jobs in the array.

EDIT 2: I just noticed that there are indeed command line args for OMA (not just the options specified in the parameter file), and one of those allows specifying a number of threads on a single node. While it would be nice to have job arrays working since those are more flexible to schedule, at least I can try to grab a few individual nodes to run this on in parallel using -n. When I get the jobarrays figured out I'll come back here and add an answer (unless someone else posts one first :) )

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Adamc590
2
gravatar for adrian.altenhoff
3.1 years ago by
Switzerland
adrian.altenhoff520 wrote:

Obviously not all versions of slurm or configurations seem to provide the same output of scontrol. Here is a temporary hack such that you can continue working. The way this works is that you set in an environment variable yourself how many jobs the array will have, and the little code snippet below will make sure that OMA gets this information correctly.

In your jobscript or shell, (I'm assuming you are using bash), put:

export NR_OF_JOBS_IN_ARRAY=5

and replace the value with what you use in the jobarray definition, so if it's 1-5, 5 is correct.

Now, we need to modify one file of OMA. navigate to where you installed OMA into (or if you didn't install it, just inside the unpacked tarball), and open the file "lib/Platforms" in your editor. we will insert a command on line 71, between the following lines:

jobidx := parse(jobidx);                                                                                                                                                                                   
jobId := getenv('SLURM_ARRAY_JOB_ID');

we add an extra line, such that the file looks like this now:

jobidx := parse(jobidx);                                                                                                                                                                                   
return(ParallelInfo(parse(getenv('NR_OF_JOBS_IN_ARRAY')), jobidx));
jobId := getenv('SLURM_ARRAY_JOB_ID');

save the file, and retry running OMA. Hope this will do it for the moment. Adrian

ADD COMMENTlink written 3.1 years ago by adrian.altenhoff520

Fantastic, it appears to be working now. Much easier to schedule individual single-CPU jobs than entire nodes! Thank you for looking into it! (and trying to support multiple HPC systems to begin with).

ADD REPLYlink written 3.1 years ago by Adamc590
2
gravatar for adrian.altenhoff
3.1 years ago by
Switzerland
adrian.altenhoff520 wrote:

Hi Adamc

I'm one of the developers of OMA. The -n option is indeed mostly intended to be used on non-cluster machines that have multiple cores without a scheduler. the job-array is definitely the way to go in your case. If you start jobs with -n on multiple nodes, chances are high that you get collisions on the file system.

Unfortunately, I myself have not a lot of experience with slurm clusters and the different configurations that are used. But I'm happy to help investigating what breaks the job-array on your cluster. Essentially, OMA tries to figure out the job-array parameters in lib/Platforms, where it uses the environment variables and scontrol calls. I suggest that you try to run the following script that mimics the way OMA looks for job-array parameters:

#!/bin/bash
sleep 3
if [ -z "$SLURM_ARRAY_TASK_ID" ]; then
    echo "SLURM_ARRAY_TASK_ID not set"
fi
if [ -z "$SLURM_ARRAY_JOB_ID" ]; then
    echo "SLURM_ARRAY_JOB_ID not set"
fi

info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId)
echo $info

The last line should print a line that is of the form

ArrayTaskId=<start_nr>-<end_nr>

If it doesn't, could you please let me know what scontrol show jobid -dd $SLURM_ARRAY_JOB_ID returns.

ADD COMMENTlink written 3.1 years ago by adrian.altenhoff520

Hi, thanks for the response. I just tried running this with a simple array of 2 jobs, here's what I get:

cat slurm-7376383_1.out

TASK:
1
JOB:
7376383
JobId=7376383 ArrayJobId=7376383 ArrayTaskId=2 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId) JobId=7376384 ArrayJobId=7376383 ArrayTaskId=1 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId)

cat slurm-7376383_2.out
TASK:
2
JOB:
7376383
JobId=7376383 ArrayJobId=7376383 ArrayTaskId=2 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId) JobId=7376384 ArrayJobId=7376383 ArrayTaskId=1 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId)

And here were the parameters for the batch:

#!/bin/bash
#SBATCH --partition=debug
#SBATCH --time=00:05:00
#SBATCH -N1
#SBATCH --mem-per-cpu=2gb
#SBATCH --job-name="OMA_test"
#SBATCH --array=1-2

I added lines that echo the job and task IDs as a sanity check.

ADD REPLYlink written 3.1 years ago by Adamc590

mmmh, looks like the reported output of scontrol show is different to what it used to be. maybe this is also simply because the array is very short. could you verify by running the same on a job-array of 1-100?

ADD REPLYlink written 3.1 years ago by adrian.altenhoff520
0
gravatar for Adamc
3.1 years ago by
Adamc590
United States
Adamc590 wrote:
#!/bin/bash

#SBATCH --partition=standard
#SBATCH --time=00:01:00
#SBATCH -N1
#SBATCH --mem-per-cpu=1gb
#SBATCH --job-name="OMA_test"
#SBATCH --array=1-100

sleep 3
if [ -z "$SLURM_ARRAY_TASK_ID" ]; then
    echo "SLURM_ARRAY_TASK_ID not set"
fi
if [ -z "$SLURM_ARRAY_JOB_ID" ]; then
    echo "SLURM_ARRAY_JOB_ID not set"
fi

echo "TASK:"
echo $SLURM_ARRAY_TASK_ID
echo "JOB:"
echo $SLURM_ARRAY_JOB_ID

info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId)
echo $info

Generated output files with indices appended to names from 1 - 100, and the output was something like this for each file, with the same info in each file for everything from 1 - 100. Here's an example/excerpt:

TASK: 10 JOB: 7386776 JobId=7386776 ArrayJobId=7386776 ArrayTaskId=100 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId) JobId=7386875 ArrayJobId=7386776 ArrayTaskId=99 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId) JobId=7386874 ArrayJobId=7386776 ArrayTaskId=98 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId) JobId=7386873 ArrayJobId=7386776 ArrayTaskId=97 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId) JobId=7386872 ArrayJobId=7386776 ArrayTaskId=96 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId) JobId=7386871 ArrayJobId=7386776 ArrayTaskId=95 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId) JobId=7386870 ArrayJobId=7386776 ArrayTaskId=94 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId) JobId=7386869 ArrayJobId=7386776 ArrayTaskId=93 JobName=OMA_test info=$(scontrol show jobid -dd $SLURM_ARRAY_JOB_ID | grep ArrayTaskId)

So it seems like the behavior is the same as in the previous case. Did I get the script right? I haven't played around with scontrol much, since I figured many operations with scontrol required admin-level rights. By the way, this is all with slurm version 14.11.9

ADD COMMENTlink written 3.1 years ago by Adamc590

Hi Adam, indeed, looks all the same as above. I'll need to get back to slurm and checkout how this could be done in a more stable fashion. As a temporary fix, see my answer below. Cheers

ADD REPLYlink written 3.1 years ago by adrian.altenhoff520
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2299 users visited in the last hour