Find reference genome regions spanned by only mapping quality 0 reads in multiple WGS samples
Entering edit mode
3 months ago
William ★ 5.2k

For the parallelization of multi-sample variant calling I am looking for reference genome regions to split on.

With the T2T reference genomes, there are not that many polyN regions left to split on.

I am thinking about using the mapping quality 0 regions to split the multi-sample variant calling on.

I would like to find reference genome regions > 500bp that are:

  1. covered by mapping quality = 0 reads
  2. not covered by mapping quality > 0 reads
  3. many/all WGS samples show this pattern

Only when the above 3 points hold it's safe I think to split multi-sample variant calling on these regions

(i.e. take the inverse regions as callable regions to process in parallel)

Input would be 1 FASTA file, many BAM/CRAM files, output a BED file.

mapping-quality BAM FASTA • 205 views

