I'm running STRUCTURE v 2.3.4 on an HPC cluster. I've run it successfully many times before, but I've run into a recurring problem that I can't fix.
I used StrAuto to set up command lists in order to run replicates and send the outputs to specific directories, and I have run scripts that look like this:
#!/bin/sh
## 2024-08-21
## using: Humber_PGDstr_fixed, under admixture model where alpha is allowed to vary
## burnin = 1,000,000
## iterations = 500,000
#SBATCH --account=<name>
#SBATCH --time=28-00:00:00
#SBATCH --nodes=1
#SBATCH --mem=8000
#SBATCH --ntasks-per-node=1
module load StdEnv/2020
module load gcc/9.3.0
module load nixpkgs/16.09
module load python/2.7.14
module load intel/2018.3
module load structure/2.3.4
set -eu
cat commands_run01 | parallel -j 8
mv k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 results_f/
mkdir harvester_input
cp results_f/k*/*_f harvester_input
echo 'Your structure run has finished.'
# Run structureHarvester
./structureHarvester.py --dir harvester_input --out harvester --evanno --clumpp
echo 'structureHarvester run has finished.'
#Clean up harvester input files.
zip Humber_PGDstr_fixed_Harvester_Upload.zip harvester_input/*
mv Humber_PGDstr_fixed_Harvester_Upload.zip harvester/
rm -rf harvester_input
Most of the time (but not always) the runs fail at about ~15 days in, with the slurm output reporting: Segmentation fault (core dumped). When I check with seff
, it returns:
ยง Checking seff for this job:
Job ID: 44471783
Cluster: <name>
User/Group: <name>
State: FAILED (exit code 10)
Cores: 1
CPU Utilized: 15-11:51:11
CPU Efficiency: 99.49% of 15-13:46:12 core-walltime
Job Wall-clock time: 15-13:46:12
Memory Utilized: 136.53 MB
Memory Efficiency: 1.71% of 7.81 GB
I've looked around for what "exit code 10" means for slurm, but I can't find anything beyond "some error".
I'm guessing that this is likely a memory problem at the writing step, but I can't figure out what part of my instructions are incorrect, and why it fails sometimes but not always (given that all my run scripts are functionally the same).
Any ideas on what is going wrong here?
Have you looked at the logs for STRUCTURE. It appears that the jobs are failing because of an error there.
Since you are using a job scheduler why are you using
parallel
? Looks like you are only using a single node but not specifying how many CPU cores? How many are you allowed to use by default?Thanks for your reply. Yes I checked the logs for STRUCTURE, there are no errors there, it just stops writing at some point.
I'm using parallel because we have so many replicates to run, it is generally more efficient (I have also set up the scripts using StrAuto, a helper tool that automatically sets things to run in parallel.
I tried specifying the number of cores before but I must've been doing it incorrectly, it failed within a couple days. I'm not sure if there is a default, but I know I can request a 48-core node for for big jobs. It's a bit of a dance because asking for a lot (e.g., a whole node) can tank my research group's scheduling priority.
Any ideas on how to more efficiently ask for the memory in the scheduling? I'm guessing that's the issue ...
I doubt that a proper job scheduler is going to be more efficient at running jobs than including parallel in the mix.
It feels like there is some interaction between SLURM/parallel/STRUCTURE that is causing the problems you are having.
Based on this it would appear that memory is not the issue but again I am not sure how
parallel
is figuring in this mix.