It is a known fact that GATK does not scale or work very well when a draft assembly has a lot of contigs. Mine has 366989 (water buffalo reference genome). I have successfully generated GVCFs of all my 81 samples for the whole genome through Haplotypecaller without any errors. It usually fails or freezes in the CombineGVCF stage. I have thought of a strategy and would like to know if there may be any pitfalls or any important point I need to consider.
Since, I am only interested in my 'genes of interest', and genes are generally not present in all my contigs, I select only those lines from my GVCF files that belong to those contigs and leave the rest. It reduced my total no. of contigs to 4338. (Used this command to find that- cat GCF_000471725.1_UMD_CASPUR_WB_2.0_genomic.revised.gtf|cut -f 1|uniq|sort >contigs_with_genes.txt)
I then split the 4338 contigs to sets of 23 contigs (because GATK has been made keeping humans in mind), split each GVCF files into 23 smaller GVCF files and then run CombineGVCF on my 81 samples.
I will end up with 23 combined GVCF files and I can run GenotypeGVCF on them to get final VCF file which I can merge vertically to get my final VCF file.
PS. I cannot run GenomicsDBimport (which is GATK's current recommendation in place of CombineGVCF) because of the caveat that it can run on one contig at a time and even though I have access to a grid engine, my calculation suggested it will take years to get over because of my number of contigs.