I am trying to determine the presence or absence of a deletion over a short (< 10 bp) region in 25 Drosophila melanogaster genomes for which I have whole-genome sequence data.
I ran the standard pipeline to map the reads to a reference genome and used GATK to genotype all sites, and I found that very few of my genomes had this (supposedly) common deletion. What we think is occurring is that in many cases the deletion is in the heterozygous state, so rather than genotyping NN..N, we have a identified bases at the loci of interest in the vcf.
My question is what would be the best way to identify the deletion either from the vcf or (going further up the pipeline) from the reads in the fastq files.
For the vcf, my initial thought was this: all being equal, the read depth at sites heterozygous for the deletion should be approximately 1/2 that for adjacent sites flanking the deletion. So as a heuristic, I could check read depth at the loci of interest and compare them to neighboring sites. However, this approach isn't particularly robust due to high variance in read depth among adjacent sites.
So I'm trying to figure out how to glean this information from the reads themselves, and would appreciate any suggestions.