Question

Parallelization of task according to the cpu capacity

0

Entering edit mode

4 months ago

davidmaimoun ▴ 50

Hello, In my ubuntu distribution I have 40 cpu. For my bioinformatics analysis, for instance, to execute an assembly via spades, I would like parallelize tasks using "&". But I don't know how to do that. What will happened when my computer will reach this full capacity? Let say I have 4 samples, should i give 10 cpus each?

From now, I an running my samples in a for loop, and give the full cpu capacity to each sample, without parallelization, but I think there in a clever way to optimize it.

spades.py -1 {sample}_R1.fastq -2 {sample}_R2.fastq -t 40 -o {out_dir}

Thank you!

parallelization cpu • 850 views

ADD COMMENT • link 4 months ago by davidmaimoun ▴ 50

1

Entering edit mode

, I would like parallelize tasks using "&"

you'd better use a workflow manager using snakemake or nextflow.

ADD REPLY • link 4 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

That you for your answer, You are right, but I am building an app in streamlit that allow the user to see the progression of each sample and process. If I run the whole workflow in nextflow I wouldn't be able to display the progression.

ADD REPLY • link 4 months ago by davidmaimoun ▴ 50

1

Entering edit mode

In my ubuntu distribution I have 40 cpu.

and

spades.py ..... -t 40

If you have 40 cores and you are already using them for one job you can't "parallelize". If you start multiple jobs using the same 40 cores then you will simply end up with contention issues and overall poor experience.

You could use 10 cores each and start 4 jobs in parallel but depending on capability of your computer hardware there would be bottlenecks with input/output etc with the end-result being the same as above.

Sometimes it may be worth doing serial jobs allowing each job to complete utilizing available resources to their full potential.

ADD REPLY • link 4 months ago by GenoMax 141k

0

Entering edit mode

Understood! Thank you very much for the help

ADD REPLY • link 4 months ago by davidmaimoun ▴ 50

score 1 · Answer 1 · 2023-12-25

1

Entering edit mode

4 months ago

mixiaoluo88 ▴ 10

here is a simple shell scripts for mission parallel:

#!/bin/bash

num_threads=4

for i in `cat files.txt`; do
    (
        # copy your code here
        echo "starting task $i.."
        sleep $(( (RANDOM % 3) + 1))
    ) &

    if [[ $(jobs -r -p | wc -l) -ge $num_threads ]]; then
        wait -n
    fi

done

wait

echo "all mission done"

ADD COMMENT • link 4 months ago by mixiaoluo88 ▴ 10

0

Entering edit mode

Very useful thank you!

ADD REPLY • link 4 months ago by davidmaimoun ▴ 50

score 1 · Answer 2 · 2023-12-26

Hi @davidmaimoun,

If the job that's running uses the full capacity of the CPU, then you do not need any parallelization. Otherwise, jobs will be terminated. But if all CPUs are allocated for a single job and it does not fully utilize them "100%", then you might need a task scheduler, which is a bit complicated "at least for me!". Instead, I used "GNU Parallel," which is easy to utilize. For me, I was looking for something that sped up variant calling with GATK HaplotypeCaller while I am running human exome. HaplotypeCaller does provide a parameter to assign a number of CPUs for a single task "--native-pair-hmm-threads," but it just used like 25% of each core, "which was not acceptable for me". Here is my GNU Parallel command that just works for my needs:

cat ListOfBAMs.txt | parallel -j 100% java -jar gatk-package-4.2.6.1-local.jar HaplotypeCaller --native-pair-hmm-threads 128 -I {} -O {.}.vcf -R "$ref" -L "$bed"

Make sure to read the manual for a better understanding. Check out the --memfree and --memsuspend parameters to avoid job termination silently.