Question

Joint Calling for Large Germline WGS Cohort

2

Entering edit mode

3 months ago

j.k3096 ▴ 20

Hello,

I am working with germline WGS data from a cohort of 2,700 patients. To study the germline variants in this cohort, I need to perform joint variant calling. I’ve started by creating a GenomicsDB (https://gatk.broadinstitute.org/hc/en-us/articles/360036883491-GenomicsDBImport) and plan to use GenotypeGVCFs afterward.

However, I am currently facing significant Memory usage challenges during the GenomicsDB creation step. As a workaround, I’ve been adding smaller batches (300–400 samples at a time) to the GenomicsDB.

If anyone here has worked with similarly large WGS cohorts or has experience in joint calling at this scale, I would greatly appreciate your recommendations and advice. I anticipate subsequent steps like GenotypeGVCFs may also be memory-intensive, so I am looking for ways to optimize resource usage.

One solution I’m considering is dividing the genome into smaller intervals but I would be grateful for any alternative approaches or optimizations you might suggest.

Thank you for your time and help !

Best regards,

NGS RAM cohort Genomics WGS • 894 views

ADD COMMENT • link updated 3 months ago by Jeremy Leipzig 23k • written 3 months ago by j.k3096 ▴ 20

score 1 · Answer 1 · 2025-07-07

1

Entering edit mode

3 months ago

Pierre Lindenbaum 166k

try glnexus https://github.com/dnanexus-rnd/GLnexus/wiki/Getting-Started

ADD COMMENT • link 3 months ago by Pierre Lindenbaum 166k

score 1 · Answer 2 · 2025-07-07

1

Entering edit mode

3 months ago

DBScan ▴ 530

Another option would be HAILs VDS Combiner, https://hail.is/docs/0.2/vds/hail.vds.combiner.VariantDatasetCombiner.html#hail.vds.combiner.VariantDatasetCombiner.

ADD COMMENT • link 3 months ago by DBScan ▴ 530

score 0 · Answer 3 · 2025-07-07

Not to get too pedantic but joint genotyping solves a different problem (removing artefactual variants) from producing a population VCF that can be analyzed.

If you just want the latter you can ingest individual VCFs (or gVCFs) into a TileDB-VCF dataset (the GenomicsDB you mention is an ancient predecessor of TileDB).

You can then perform basic chr/pos/sample queries in Python using the TileDB-VCF open source library:

https://github.com/TileDB-Inc/TileDB-VCF

... or switch to the commercial product if you need things like distributed queries, user-defined functions & task graphs, and access management:

https://www.tiledb.com/

(disclaimer: I am the product manager for TileDB-VCF)