The issues you face relate to the fact that the majority of the genome exhibits sequence similarity, i.e., similarity with other regions in the genome. Much of this is indeed related to gene duplication events, with the duplicated genes acquiring new functionality over time due to mutations. As a rough idea, there are up to 50,000 identified pseudogenes (who knows, exactly), which can be divided into:
- processed pseudogenes: the pseudogene consists of the transcribed mRNA of the original gene
- unprocessed pseudogenes: the pseudogene consists of the genomic
sequence of the original gene
This translates into issues with Exome-seq because the primers used for sequence pull-down in exome-seq are not designed with these issues of sequence similarity in mind. Thus, when you align the data, you can set things like read length and mapping quality (MAPQ) to be high but then you'll see very low coverage over regions of high sequence similarity. On the other hand, if you relax the thresholds, you run the risk of misalignment and making false-positive or -negative variant calls.
What to do?
To validate findings, you need to ensure that you design primers that uniquely target the region surrounding the variant being studied. If you cannot find a unique region in close proximity, you'll have to think about doing:
- long-range PCR
- Sanger or Roche 454 sequencing (long reads...)
If you want assistance in designing the best possible primers, then please follow my standard operating procedure that I wrote back in 2012, with which I and colleagues have had success: Designing a single set of primers and probe for a genomic region of
interest. In it, you will have to skip step 5.6 as it requires the use of Primer Express, but this is only needed in order to develop the probe that's used in addition to a primer pair in real-time PCR.