I'm trying to map genomic sequencing reads (Illumina HiSeq PE100) to a related reference genome. The coding region divergence is about 1% between the organism and the reference, so I allowed 5~8 mismatches in 100bp reads as well as allowing small indels, hoping this could accommodate the higher divergence expected outside the exons. But in the coverage plot, coding regions still got the most coverage. This bias is so severe that it looks like an mRNA-Seq experiment. Of course, there are regions with relatively uniform coverage outside the exons (so they should be true genomic reads), but they're much rarer than the coverage 'deserts' elsewhere. The overall coverage, based on kmers, is about 5X, which can be a reason why this is happening. Also, is there anything wrong I did in terms of the way I approach the mapping process?
The problem is that you are mapping to a "related reference genome". Clearly, coding regions are much more conserved than intragenic or introns, so reads from exons map a lot better. I suspect only a relatively small fraction of your reads maps.
you will need some sort of de-novo or use your related reference genome as a scafolding (but it will not be a task of a day...)