Entering edit mode
3 months ago
QX
▴
60
Hi all,
I have many sequencing data, where I would like to speed up the analysis process on HPC. I am struggling with different approaches:
- using
parallel
with many job - using array in
sbatch
- submit via multi nodes
- submit via multi tasks
- optimize the cpus and memory usage
Can anyone have a (general) idea how can I deal with such a problem so that I can optimize HPC resources usage?
Best,
What I usually do is running Snakemake or Nextflow. This way you can easily submit each single command to the HPC. For instance for SLURM, each command would run on a single node with the number of cores and memory you have specified.
Hi, can you share the Snakemake or Nextflow that you have mentioned, I will look into it. however, I am thinking how can you know how much resources you need to set for a particular data? let say between 1G, 10G, 100G, or 10,000 files with only 1mb per files?
there is no file to share. Both are workflow managers, you need to build a script for your workflow that will be handled by one of those tools.
thank you!
This depends on the HPC configuration (eg. how much RAM and/or how many cores).
Most importantly, it depends on how the scripts and workflows are written. If this is a for loop then there is no easy parallelization. If it is a function that iterates of files maybe an array can help. Or even
parallel
. @OP, you need to show how you wrote your scripts. Please try a representative and short example so people get an idea.for e.g.
Trimgalore
or for submit multiple sbatch
for the filtering scrip sbatch setting:
When you have access to a proper job scheduler why are you using
parallel
? Just submit multiple jobs (one for each sample). It is inefficient to use afor
loop inside a single SLURM job. Look into job arrays instead.thank for your suggestion. I will check the job array! bwa, can you make more clear what is the different between
parallel
and job scheduler?There is no way to answer it without knowing exact steps/pipeline and the input sizes. Just to give you an obvious example: indexing a sorted by positions BAM file is really fast, does not require tons of RAM or temporary disk space. On the other hand mapping reads to i.e. mammalian genome needs RAM, benefits a lot from multiple cores, etc.
In summary: use a workflow manager (Nextflow?) as suggested already and dedicate different number of CPUs/RAM to different steps. If possible, do some test/benchmark runs using different queues to identify the most problematic/taking most time steps.
All this means little on a cluster with several queues shared by a number of users. What runs great on queue A one day may be stuck for days if you submit the same job tomorrow. This can be fixed (brainstorming mode on) by i.e. creating say two different
nextflow.config
version and launching the Nextflow pipeline:You should also check with your local IT support as they would be your best resource. We don't know how your cluster is setup, what limits there are on resources your account can use at one time and the configuration of the cluster.