I'm running the GATK on 500 samples to call variants in a few megabases of hg18. I am finding that it's going surprisingly slowly. For instance, I have UnifiedGenotyper running on some 1kb regions at the moment, and many have been running over 12 hours without completion. This could be because parts of the regions I'm targeting for caling were capture-targetted, and the pile up of illumina reads aligned to those regions can be very deep. So my next experiment is to try to mitigate the effect of these deeply covered regions by running GATK with a relatively low
-dcov value, say around 50. If this could be expected to substantially affect its accuracy, I would be grateful to learn about it.
Here are the options I'm running GATK with, in case I'm doing something silly:
-T UnifiedGenotyper -glm BOTH -L $region \
-R .../human_b36_both.chr.fasta -o $outpath -I <bamfile> -I <bamfile> ...
Also, I understand there's a markov chain underlying the UG's calls. I suspect slow convergence might be the main factor. Is there an option to tell UG to punt on a site after a certain length of markov chain?