SLURM exit code 1 (general failure) with no error printed in the log file
0
0
Entering edit mode
20 months ago
Matteo ▴ 10

Hi everyone,

I have been running ELAI on an HPC, which I have successfully done in the past, but now I am getting failed SLURM reports (Exit code = 1). However, the log file of the submitted job seems to be ok and no error is printed (see below). The output file seems ok as well and has an appropriate size, comparable to those obtained from past analyses. Does anyone know what might create such discordance between SLURM reports and log files? Is it safe to rely on the output files that have been generated? Thanks in advance for the help!!

Matteo

Log file output:

 ## COMMAND: /home/vonholdt/VONHOLDT/BIN/elai/elai-lin -g ref_extoni_for_elai_chr05.recode.geno.txt -p 10 -g ref_pusillus_for_elai_chr05.recode.geno.txt -p 11 -g chrysopus_to_infer_for_elai_chr05.recode.geno.txt -p 1 -pos chrysopus_to_infer_for_elai_chr05.recode.pos.txt -s 30 -o chr05_run3_mg15 -C 2 -c 10 -mg 15
## randseed = 1661469556
## warning: number of position files = 1
## warning: position files contain 1157298 records.
## warning: File 0 has 26 ind's and 1157298 SNPs
## warning: File 1 has 7 ind's and 1157298 SNPs
## warning: File 2 has 103 ind's and 1157298 SNPs
### m_morgan = 0.482452
### total number of individuals 136
## number of panel individuals = 33
## number of cohort = 103
## number genotype files = 3
## number phenotype files = 3
## number of diploid = 136
## number of haploid = 0
## number of individuals = 136
## number of snp = 1157298
 ### estimated total genetic distance 0.482452
 ### constrained upper layer switches 7.23678
 ### constrained lower layer switches 482.452
 ### constrained ancillary switches 1
### 0 0  -80954696.237  -80954696.237    500.507       7.397       0.935
### 0 1  -42135163.282  38819532.955     766.024       8.044       0.750
### 0 2  -29799695.683  12335467.599    2875.696      34.144     197.866
### ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
trk is nan in rk upate 

...

0 28     -109855861.541 8652.939    36316.739     24.034     672.992
### ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
ta is nan in beta update 
trk is nan in rk upate 
0 29     -110010505.507 -154643.966 36622.642     24.250     673.884
## EM seconds used = 65041
## ELAI generate following files in the output directory.
## chr05_run3_mg15.snpdata.txt
## random seed = 1661469556
ELAI SLURM • 3.9k views
ADD COMMENT
1
Entering edit mode

To me, this looks most like you are losing that output because you are getting output from each slavescript, but not necessarily from the master script. Could be wrong, though, it is hard to tell.

How do the relevant portions of the masterscript look, relating to error handling, stderr and stdout? Are you using the masterscript to kickoff many slave processes? it may be that the log files are generated correctly for each of these, but not for the master ? can you comment on this?

see also the below for ideas

ADD REPLY
0
Entering edit mode

example:

#SBATCH --job-name=${VAR}_JOB
#SBATCH --ntasks=1                              # Number of PROCESSES
#SBATCH --mem-per-cpu=8000                      # Memory specified for each core used (in MB) (no cores, use --mem=)
#SBATCH -t 2-02:00:00                           # Runtime in D-HH:MM:SS
#SBATCH --share
#SBATCH --partition=medium                      # express(2h), short(12h), medium(2d2h), long(6d6h), interactive(2h)
#
#SBATCH --mail-user=${USER}@${emailExtension}
#SBATCH --mail-type=ALL                         # BEGIN, END, ERROR, ALL
#
#SBATCH --error=${LOG_FILE}.%j.%N.err.txt        # ***************** <- this guy?
#SBATCH --output=${LOG_FILE}.%j.%N.out.txt       # ******************** <- this guy?
#
# Mimimum memory required per allocated  CPU  in  MegaBytes.
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=${USER}@${emailExtension}             #### Modify to your email address.
ADD REPLY
0
Entering edit mode

Consider also this snippet I used to use for an old sungrid parallel submission script

RC=$?
echo `date` "RC = $RC"
if [ "$RC" == 0 ]; then
    date > $DONE_FILE
fi
exit $RC 2>&1 > /dev/null
EOT
ADD REPLY

Login before adding your answer.

Traffic: 1274 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6