Parabricks : Number of GPUs requested (2) is more than number of GPUs (0) in the system., exiting.
4
0
Entering edit mode
3 days ago

CROSS-POSTED: https://forums.developer.nvidia.com/t/4-5-0-1-haplotype-caller-number-of-gpus-requested-2-is-more-than-number-of-gpus-0-in-the-system-exitin/344148

Hi all, I'm trying to run nvidia/parabricks on our cluster. I'm currently using an apptainer image of 'pb'. I was able to run fastq2bam without any problem but when I'm using "haplotypercaller' I get the following error:

[PB Error 2025-Sep-05 18:17:13][src/haplotype_vc.cpp:843] Number of GPUs requested (2) is more than number of GPUs (0) in the system., exiting.

The command was:

nvidia-smi 1>&2

pbrun haplotypecaller \
    --num-gpus 2 \
    --ref Homo_sapiens_assembly38.fasta \
    --in-bam "name.cram" \
    --gvcf \
    --out-variants "name.g.vcf.gz" \
    --tmp-dir TMP \
    --logfile name.hc.log \

the stderr is:

INFO:    underlay of /etc/localtime required more than 50 (79) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (374) bind mounts
Fri Sep  5 18:17:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:21:00.0 Off |                    0 |
| N/A   30C    P0             33W /  250W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  |   00000000:81:00.0 Off |                    0 |
| N/A   30C    P0             33W /  250W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
[PB Info 2025-Sep-05 18:17:13] ------------------------------------------------------------------------------
[PB Info 2025-Sep-05 18:17:13] ||                 Parabricks accelerated Genomics Pipeline                 ||
[PB Info 2025-Sep-05 18:17:13] ||                              Version 4.5.0-1                             ||
[PB Info 2025-Sep-05 18:17:13] ||                         GPU-GATK4 HaplotypeCaller                        ||
[PB Info 2025-Sep-05 18:17:13] ------------------------------------------------------------------------------
[PB Error 2025-Sep-05 18:17:13][src/haplotype_vc.cpp:843] Number of GPUs requested (2) is more than number of GPUs (0) in the system., exiting.

I don’t know much about working with GPUs/nvidia, I don't understand the output of nvidia-smi ("disabled" ?). Can you please tell me what I’m doing wrong ?

Pierre

haplotypecaller parabricks gpu • 6.8k views
ADD COMMENT
0
Entering edit mode

Are you running this under a job scheduler? Is there a separate partition for the GPU's/are they accessible to the scheduler?

ADD REPLY
0
Entering edit mode

GenoMax I'm using the 'GPU' queue of my cluster (SLURM). The very same config was used with another parabrick subtool and I got not problem.

ADD REPLY
0
Entering edit mode

I faced a similar issue with Parabricks and version 4.3.0; using --htvc-low-memory resolved the problem.

ADD REPLY
0
Entering edit mode

That option is indicated for using a 16GB GPU. Was that the case or even though you had a >16 GB GPU, this option was needed to fix the error in the original post.

ADD REPLY
0
Entering edit mode

The GPUs have 24 GB memory each but only worked with the flag.

ADD REPLY
0
Entering edit mode

it doesn't work with --htvc-low-memory (same error with 4.5.0-1 )

ADD REPLY
1
Entering edit mode

The --nv flag for apptainer is there?

ADD REPLY
6
Entering edit mode
17 hours ago

OK, I'm embarrassed but it was an error on my side. I use nextflow to run parabricks. For fq2bam there was the directive:

label "process_gpus"

while, after a copy+paste, for pb_haplotypecaller , there was still

label "process_single"

There was a manual clusterOptions for pb_haplotypecaller that misled me: I was really sure the process was using the gpus.

Opsss

ADD COMMENT
2
Entering edit mode
19 hours ago
DBScan ▴ 530

Since you were able to run fastq2bam without any issue, I would first try to run HaplotypeCaller on the exact same node you ran fastq2bam to exlude the possibilty on the nodes not being exactly the same. If you still have an error, I would suspect a bug in Parabricks.

Have you tried running DeepVariant? Would be interesting to see if you run into the same issue. If you have the same error, I would try to run the non-parabricks DeepVariant and see if the issues persists.

Edit: Do you still have the logs of your successful fastq2bam run?

ADD COMMENT
1
Entering edit mode

thanks that put me on the way to understand what's happened (wrong conf on my side)

ADD REPLY
3
Entering edit mode
3 days ago
Mensur Dlakic ★ 29k

How often do we get a chance to help Pierre Lindenbaum after all the help Pierre has provided? I feel like we have to make a serious effort here.

On a personal computer, nvidia-smi shows its status as N/A. On our cluster, it shows GPU status as Disabled like what you see when probing their state directly. Yet all those GPUs function perfectly fine when a job is submitted via SLURM. This is to say that I wouldn't worry about that Disabled message.

My first suggestion: make sure to run the job on a node that has GPUs, assuming that there are some CPU-only nodes.

Next, load all the CUDA/cuDNN modules in your job file before running the program. For me it would be something like this:

module load CUDA/11.4.1
module load cuDNN/8.2.2.26-CUDA-11.4.1

Next, make sure to explicitly state how many GPUs are required for your job.

#SBATCH --gpus-per-task=2

Less important, but maybe more so for you, is to specify the amount of VRAM.

#SBATCH --mem-per-gpu=40G
ADD COMMENT
1
Entering edit mode

For posterity, here is nvidia-smi output for our cluster:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06              Driver Version: 555.42.06      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  |   00000000:21:00.0 Off |                    0 |
| N/A   26C    P0             37W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  |   00000000:81:00.0 Off |                    0 |
| N/A   25C    P0             34W /  250W |       1MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
ADD REPLY
0
Entering edit mode

Parabrix is not supported on multi-instance GPU's (MIG) so having that setting disabled is perfect.

ADD REPLY
0
Entering edit mode

Thanks for the suggestion. I've got --gres=gpu:2 in my sbatch header (so I imagine it doesn't change much things (?) ).

I also suspect a bug in parabricks (?) or something strange happened when I the docker image was converted to apptainer (?).

ADD REPLY
1
Entering edit mode
17 hours ago

I agree with Mensur Dlakic :) We best make an effort here.

On our SLURM GPU cluster, GPU jobs have to use srun in the job itself.

I.e.:

srun pbrun haplotypecaller [...]

I still don't really know why we have to use srun in this context: without srun, the job doesn't see the GPU that has been allocated to the job in the SLURM header. For many CPUs/large amounts of RAM it doesn't seem to matter.

There are some similar GPU jobs on the Pawsey wiki, i.e.: https://pawsey.atlassian.net/wiki/spaces/US/pages/613089393/Using+ProteinMPNN+on+AMD+GPUs+at+Pawsey

They even ask for the GPU specifically in srun:

srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 pbrun .[...]

Try srun!

ADD COMMENT

Login before adding your answer.

Traffic: 2821 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6