Your assumptions are correct, and it's important to realise that a large chunk of the human genome exhibits some level of homology (generally, for a sequence of DNA to be regarded as homologous, it must exhibit 30% similarity to another or other regions). Don't quote me, but I read somewhere that >50% of all genes have a processed or unprocessed pseudogene elsewhere in the genome. Thus, the 'expected off-target' regions provided by Illumina for your panel are most likely these regions that exhibit high homology to your primary target regions of interest.
All of this poses great issues for alignment tools, which have to faithfully map each read to a position in the genome. If a read maps to >1 location, its mapping quality will suffer. However, if it maps to just a single region, then it will certainly have a high mapping quality. Base errors in each read neither help, in this regard, as they further reduce mapping quality and make the task of the aligner more difficult.
This issue is also in part to explain for the very uneven depth of coverage profile that you get with this type of sequencing, whereby one region may have >1000 reads mapped to it, whereas others may have just 20 (other reads that could have mapped to it were 'robbed' by homologous regions during PCR amplification and/or during in silico alignment).
From my experience of targeted sequencing using Illumina's kits, the amount of off-target reads is generally 30-40% of all reads (i.e. 30-40% of reads in each sample will map to regions outside of the primary regions of interest). There is not much that you can do about this other than work with Illumina to attempt to improve the problem.
Many regions of the genome are just not suited for massively parallel sequencing using short reads - the data from these regions just cannot be trusted due to the fact that such regions exhibit high homology to others in the genome. The way to tackle these is with long-range PCR or Sanger sequencing, where you can design primers far outside your region of interest in a region of unique sequence.
From an analysis perspective, the way that I manage this issue specifically is by:
- Trim bases off the ends of reads that fall below Phred-scaled quality
score of 30
- Eliminate short reads (<50 or 70bp)
- Only include uniquely-mapped reads (Bowtie allows this) or filter out
reads with MAPQ<40 or 50 (BWA)
- Use a BED file to filter out all reads or variants called in the
Other people will of course have their own ideas, which are welcome.
I really appreciate your question as it touches on what is a major issue in next generation sequencing.