What is the best way to find intervals for parallelization of joint variant calling?
In general I can think of 3 ways:
- Split on poly-N regions in the reference genome. e.g. via picard ScatterIntervalsByNs
- Split on regions without mapped reads
- Split every 1Mbp, book-ended or overlapping
With improvements in sequencing quality it is becoming more difficult to find areas that are obviously safe to split the analysis on.
- Less and smaller poly-N regions
- More regions of the genome can have reads mapped (if using e.g. (long) reads of variable length)
This leads to the risk of not much parallelization (i.e. max per chromosome) or splitting right in the middle of a variant.
What is the best way to parallelize joint variant calling?
(e.g. GATK GenomicDBImport and GenotypeGVCFs on many GVCFs)?
Without losing variants or getting double variants?