For the parallelization of multi-sample variant calling I am looking for reference genome regions to split on.
With the T2T reference genomes, there are not that many polyN regions left to split on.
I am thinking about using the mapping quality 0 regions to split the multi-sample variant calling on.
I would like to find reference genome regions > 500bp that are:
- covered by mapping quality = 0 reads
- not covered by mapping quality > 0 reads
- many/all WGS samples show this pattern
Only when the above 3 points hold it's safe I think to split multi-sample variant calling on these regions
(i.e. take the inverse regions as callable regions to process in parallel)
Input would be 1 FASTA file, many BAM/CRAM files, output a BED file.