Hello,
I am working with germline WGS data from a cohort of 2,700 patients. To study the germline variants in this cohort, I need to perform joint variant calling. I’ve started by creating a GenomicsDB (https://gatk.broadinstitute.org/hc/en-us/articles/360036883491-GenomicsDBImport) and plan to use GenotypeGVCFs afterward.
However, I am currently facing significant Memory usage challenges during the GenomicsDB creation step. As a workaround, I’ve been adding smaller batches (300–400 samples at a time) to the GenomicsDB.
If anyone here has worked with similarly large WGS cohorts or has experience in joint calling at this scale, I would greatly appreciate your recommendations and advice. I anticipate subsequent steps like GenotypeGVCFs may also be memory-intensive, so I am looking for ways to optimize resource usage.
One solution I’m considering is dividing the genome into smaller intervals but I would be grateful for any alternative approaches or optimizations you might suggest.
Thank you for your time and help !
Best regards,