Asking for parallel jobs setting on HPC when analyzing sequencing data
0
1
Entering edit mode
3 months ago
QX ▴ 60

Hi all,

I have many sequencing data, where I would like to speed up the analysis process on HPC. I am struggling with different approaches:

  1. using parallel with many job
  2. using array in sbatch
  3. submit via multi nodes
  4. submit via multi tasks
  5. optimize the cpus and memory usage

Can anyone have a (general) idea how can I deal with such a problem so that I can optimize HPC resources usage?

Best,

HPC parallel • 629 views
ADD COMMENT
2
Entering edit mode

What I usually do is running Snakemake or Nextflow. This way you can easily submit each single command to the HPC. For instance for SLURM, each command would run on a single node with the number of cores and memory you have specified.

ADD REPLY
0
Entering edit mode

Hi, can you share the Snakemake or Nextflow that you have mentioned, I will look into it. however, I am thinking how can you know how much resources you need to set for a particular data? let say between 1G, 10G, 100G, or 10,000 files with only 1mb per files?

ADD REPLY
1
Entering edit mode

there is no file to share. Both are workflow managers, you need to build a script for your workflow that will be handled by one of those tools.

ADD REPLY
0
Entering edit mode

thank you!

ADD REPLY
0
Entering edit mode

This depends on the HPC configuration (eg. how much RAM and/or how many cores).

ADD REPLY
0
Entering edit mode

Most importantly, it depends on how the scripts and workflows are written. If this is a for loop then there is no easy parallelization. If it is a function that iterates of files maybe an array can help. Or even parallel. @OP, you need to show how you wrote your scripts. Please try a representative and short example so people get an idea.

ADD REPLY
0
Entering edit mode

for e.g.

Trimgalore

Trimgalore:
   #SBATCH --job-name=run_trimgalore
   #SBATCH --output=%x_%j.out
   #SBATCH --mail-type="ALL"
   #SBATCH --partition="all"
   #SBATCH --time=48:00:00
   #SBATCH --nodes=1
   #SBATCH --ntasks=2
   #SBATCH --cpus-per-task=16
   #SBATCH --mem=100G

   parallel -j $SLURM_TASKS_PER_NODE "$trimgalore --q 30 --illumina
   --gzip -o $trim_dir --path_to_cutadapt $cutadapt --cores $SLURM_CPUS_PER_TASK --paired {}_R1.fastq.gz {}_R2.fastq.gz"

or for submit multiple sbatch

#!/bin/bash
#SBATCH --job-name=run_filter_atac_master
#SBATCH --output=%x_%j.out
#SBATCH --mail-type="ALL"
#SBATCH --partition="all"
#SBATCH --time=00:10:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=4
#SBATCH --ntasks=1
#SBATCH --mem=4G

for bam in ${bam_dir}*sorted.bam
    do
    echo "Submitting bamfile: $bam"                         
    size=$(ls -s --block-size=1048576 $bam | cut -d' ' -f1) # approximate size file
    time=$((size/40))                                       # approximate for running job
    ((hour=$time/60))                                       # hour
    ((min=$time-$hour*60 + 30))                             # miniutes + 30m
    echo "Bamfile size: $size MB"
    echo "Bamfile time: $hour:$min:00"
    # custumize for job name and time:
    sbatch --job-name=$(echo f_atac_$(basename ${bam%%.*})) --time=$hour:$min:00 ${script_dir}filtering.slurm $bam
    echo "Finish Submit bamfile: $bam"
done

for the filtering scrip sbatch setting:

#!/bin/bash
#SBATCH --output=%x_%j.out
#SBATCH --mail-type="ALL"
#SBATCH --partition="all"
#SBATCH --nodes=1
#SBATCH --cpus-per-task=16
#SBATCH --ntasks=1
#SBATCH --mem=16G
ADD REPLY
0
Entering edit mode

When you have access to a proper job scheduler why are you using parallel? Just submit multiple jobs (one for each sample). It is inefficient to use a for loop inside a single SLURM job. Look into job arrays instead.

ADD REPLY
0
Entering edit mode

thank for your suggestion. I will check the job array! bwa, can you make more clear what is the different between parallel and job scheduler?

ADD REPLY
1
Entering edit mode

There is no way to answer it without knowing exact steps/pipeline and the input sizes. Just to give you an obvious example: indexing a sorted by positions BAM file is really fast, does not require tons of RAM or temporary disk space. On the other hand mapping reads to i.e. mammalian genome needs RAM, benefits a lot from multiple cores, etc.

In summary: use a workflow manager (Nextflow?) as suggested already and dedicate different number of CPUs/RAM to different steps. If possible, do some test/benchmark runs using different queues to identify the most problematic/taking most time steps.

All this means little on a cluster with several queues shared by a number of users. What runs great on queue A one day may be stuck for days if you submit the same job tomorrow. This can be fixed (brainstorming mode on) by i.e. creating say two different nextflow.config version and launching the Nextflow pipeline:

  • use config.1 if queueA is not clogged
  • use config.2 otherwise
ADD REPLY
0
Entering edit mode

You should also check with your local IT support as they would be your best resource. We don't know how your cluster is setup, what limits there are on resources your account can use at one time and the configuration of the cluster.

ADD REPLY

Login before adding your answer.

Traffic: 1127 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6