Question: snakemake on slurm cluster - jobs not updating/submitting after checkpoints? (Error submitting jobscript (exit code 1):)
1
gravatar for Berghopper
4 months ago by
Berghopper20
Kampen, Netherlands
Berghopper20 wrote:

Dear Biostars,

I have a pretty complicated pipeline I need to run on a slurm cluster, but am not able to get it to work.

For some reason, the pipeline works for smaller jobs, but as soon as I add more input files for more rigorous testing, it doesn't want to finish correctly.

I don't have a minimal example (yet) as it's the end of my workday, I will add one if this question isn't easily resolved.

So, what happens is the following:

I submit my main snakemake "daemon" job with sbatch ../slurm_eating_snakemake.sh, aka the following script:

#!/usr/bin/env bash

# Jobname
#SBATCH --job-name=SNEKHEAD
#
# Project
#SBATCH --account=nn3556k
#
# Wall clock limit
#SBATCH --time=24:00:00
#
# Max memory usage:
#SBATCH --mem-per-cpu=16G

## set up job environment
source /usit/abel/u1/caspercp/Software/snek/bin/activate
module purge   # clear any inherited modules
#set -o errexit # exit on errors (turned off, so all jobs are cancelled in event of crash)

## copy input files
cp -R /usit/abel/u1/caspercp/nobackup/DATA/ $SCRATCH
cp -R /usit/abel/u1/caspercp/lncrna_thesis_prj/src/snakemake_pipeline/ $SCRATCH
#cp -R $SUBMITDIR\/OUTPUTS/ $SCRATCH

## Do some work:
cd $SCRATCH\/snakemake_pipeline
echo $(date) >> ../bash_tims.txt
# run pipeline
snakemake --snakefile start.snakefile -pr --runtime-profile ../timings.txt --cluster "sbatch -A nn3556k --time=24:00:00 --mem-per-cpu=4G -d after:"$SLURM_JOB_ID -j 349 --restart-times 1
echo $(date) >> ../bash_tims.txt

## Make sure the results are copied back to the submit directory:
cp -R $SCRATCH\/OUTPUTS/ $SUBMITDIR
cp -R $SCRATCH\/snakemake_pipeline/.snakemake/ $SUBMITDIR
mkdir $SUBMITDIR\/child_logs/
cp $SCRATCH\/snakemake_pipeline/slurm-*.out $SUBMITDIR\/child_logs/
cp $SCRATCH\/OUTPUTS/output.zip $SUBMITDIR
cp $SCRATCH\/timings.txt $SUBMITDIR
cp $SCRATCH\/bash_tims.txt $SUBMITDIR

# CANCEL ALL JOBS IN EVENT OF CRASH (or on exit, but it should not matter at that point.)
scancel -u caspercp

I am using the abel cluster if you want to know specifics: https://www.uio.no/english/services/it/research/hpc/abel/

This is where I feel the whole thing falls apart. For some reason, when the checkpoints are finished, snakemake can't submit new jobs. I get the following error (subset of the snakemake output):

[Thu Mar 14 17:46:27 2019]
checkpoint split_up_genes_each_sample_lnc:
    input: ../OUTPUTS/prepped_datasets/expression_table_GSEA_Stopsack-HALLMARK_IL6_JAK_STAT3_SIGNALING.txt
    output: ../OUTPUTS/control_txts/custom_anno/expression_table_GSEA_Stopsack-HALLMARK_IL6_JAK_STAT3_SIGNALING-human-BP/
    jobid: 835
    reason: Missing output files: ../OUTPUTS/control_txts/custom_anno/expression_table_GSEA_Stopsack-HALLMARK_IL6_JAK_STAT3_SIGNALING-human-BP/; Input files updated by another job: ../OUTPUTS/prepped_datasets/expression_table_GSEA_Stopsack-HALLMARK_IL6_JAK_STAT3_SIGNALING.txt
    wildcards: expset=expression_table_GSEA_Stopsack, geneset=HALLMARK_IL6_JAK_STAT3_SIGNALING, organism=human, ontology=BP
Downstream jobs will be updated after completion.

Error submitting jobscript (exit code 1):

Updating job 655.
[Thu Mar 14 17:46:43 2019]
Finished job 896.
95 of 1018 steps (9%) done
Updating job 539.
[Thu Mar 14 17:47:24 2019]
Finished job 780.
96 of 1022 steps (9%) done
Updating job 643.
.......
[Thu Mar 14 17:51:35 2019]
Finished job 964.
203 of 1451 steps (14%) done
Updating job 677.
[Thu Mar 14 17:51:46 2019]
Finished job 918.
204 of 1455 steps (14%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /work/jobs/26276509.d/snakemake_pipeline/.snakemake/log/2019-03-14T172923.764021.snakemake.log

Roughly speaking what the checkpoint does, is split up an output txt with genes (so a geneset file) into seperate files called {gene}.txt for each sample. So I can feed it to my analysis algorithms.

But I am really confused with this error "Error submitting jobscript (exit code 1):", it doesn't really give a clear direction for troubleshooting.

Thanks in advance for any input!

extra info:

  • The pipeline runs fine outside of the cluster.
  • I suspect I have to do a group my jobs in a specific way, although I am not sure

I am using the following snakemake setup:

(snek) -bash-4.1$ pip freeze --local
appdirs==1.4.3
attrs==19.1.0
certifi==2019.3.9
chardet==3.0.4
ConfigArgParse==0.14.0
Cython==0.29.6
datrie==0.7.1
docutils==0.14
gitdb2==2.0.5
GitPython==2.1.11
idna==2.8
jsonschema==3.0.1
numpy==1.16.2
pandas==0.24.1
pyrsistent==0.14.11
python-dateutil==2.8.0
pytz==2018.9
PyYAML==3.13
ratelimiter==1.2.0.post0
requests==2.21.0
six==1.12.0
smmap2==2.0.5
snakemake==5.4.3
urllib3==1.24.1
wrapt==1.11.1
yappi==1.0
ADD COMMENTlink modified 4 months ago • written 4 months ago by Berghopper20

What happens if you remove -d after:$SLURM_JOB_ID? Usually it's most convenient to let snakeMake handle starting jobs.

ADD REPLYlink written 4 months ago by Devon Ryan91k

I don't know yet, I added this more as a safety feature for if the main "daemon" job terminates and the child jobs are still pending. I'll try remove it and report back.

I also added another comment, you may be on the right track actually.

ADD REPLYlink modified 4 months ago • written 4 months ago by Berghopper20

I think I resolved my issue, see my latest comment.

ADD REPLYlink written 4 months ago by Berghopper20

Bump,

It was actually not a mistake in the documentation...

ADD REPLYlink written 4 months ago by Berghopper20
0
gravatar for Berghopper
4 months ago by
Berghopper20
Kampen, Netherlands
Berghopper20 wrote:

The reason this was happening was because of slurms job scheduler... Sadly, UiO's documentation listed that you can max utilize ~400 jobs. (https://web.archive.org/web/20190314224204/https://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/queue-system.html#General_Job_Limitations) When I went out and measured it though, it was only 40 at max!

This creates a bit of a problem for pipelines that rely on running rules for their parrallelization, I might have to tweak things a bit...

Edit: This was not a mistake in the documentation, I contacted slurm admins and they verified that you actually CAN run 400 jobs on the cluster. This makes me wonder wether this is a slurm or snakemake bug...

ADD COMMENTlink modified 4 months ago • written 4 months ago by Berghopper20

As OP, can you check your own answer as solved? You're question, though very well written and with many appreciated details, is quite a bit of text to read and it took me a moment until I realized it's solved.

ADD REPLYlink written 4 months ago by Carambakaracho1.4k

Actually, it is still not solved sadly, I have contacted the slurm cluster admins and they state that I CAN use 400 jobs after all. So either this is a slurm bug or snakemake failing to handle this many jobs.

Either way, as a workaround for now I'll probably make a separate intermediary script that handles detailed multithreading per node.

ADD REPLYlink written 4 months ago by Berghopper20
0
gravatar for Berghopper
4 months ago by
Berghopper20
Kampen, Netherlands
Berghopper20 wrote:

Ok, update time:

What I've noticed is that, if I dial down the -j option (jobs submitted at the same time) the pipeline is a lot more stable for some reason. I don't know why this is, but I imagine this is more a bug/something regarding slurm rather than snakemake...

ADD COMMENTlink written 4 months ago by Berghopper20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 721 users visited in the last hour