Genomic Read Mapping Biased Towards Coding Regions?
1
3
Entering edit mode
12.1 years ago
Vitis ★ 2.5k

I'm trying to map genomic sequencing reads (Illumina HiSeq PE100) to a related reference genome. The coding region divergence is about 1% between the organism and the reference, so I allowed 5~8 mismatches in 100bp reads as well as allowing small indels, hoping this could accommodate the higher divergence expected outside the exons. But in the coverage plot, coding regions still got the most coverage. This bias is so severe that it looks like an mRNA-Seq experiment. Of course, there are regions with relatively uniform coverage outside the exons (so they should be true genomic reads), but they're much rarer than the coverage 'deserts' elsewhere. The overall coverage, based on kmers, is about 5X, which can be a reason why this is happening. Also, is there anything wrong I did in terms of the way I approach the mapping process?

coverage mapping illumina hiseq comparative • 2.9k views
ADD COMMENT
0
Entering edit mode

Maybe you could just say which organisms you are comparing and how distant they are.

ADD REPLY
0
Entering edit mode

vitis, is this whole genome shotgun data or some reduced representation library that you have sequenced?

ADD REPLY
0
Entering edit mode

These are whole genome shotgun sequences, so shouldn't be biased in terms of genome compositions.

ADD REPLY
5
Entering edit mode
12.1 years ago

The problem is that you are mapping to a "related reference genome". Clearly, coding regions are much more conserved than intragenic or introns, so reads from exons map a lot better. I suspect only a relatively small fraction of your reads maps.

you will need some sort of de-novo or use your related reference genome as a scafolding (but it will not be a task of a day...)

ADD COMMENT
0
Entering edit mode

Indeed, this sounds reasonable. Imo the mapping 'bias' is due to conservation. If you had both genome sequences and made a conservation plot, then I would bet that the mapping correlates with the conservation. In a sense this result is not really surprising.

ADD REPLY
0
Entering edit mode

the idea was to capture sequences outside coding regions, because we have coding sequences from mRNA-Seq. de novo didn't work well because the overall coverage was relatively low. It makes sense that mapping correlates with conservation but the point was by allowing more mismatches maybe the correlation can be relaxed, which didn't happen.

ADD REPLY
0
Entering edit mode

5-8% miss matches is very low. Could work for Human vs Chimpazee, but as soon as you go further it does not hold. I suspect that for some species, especially plants, 5-8% could be within the same species.

ADD REPLY
0
Entering edit mode

These two are within the same genus, but definitely further away than Human/Chimp. Looks like I underestimated the divergence in the non-coding regions.

ADD REPLY
0
Entering edit mode

I just got some Sanger sequencing results from (the ancient technology of) genome walking, which are very interesting: highly heterogeneous in terms of genomic divergence, as low as no difference to 12% divergence. Have to think of a good way to accommodate this in mapping.

ADD REPLY

Login before adding your answer.

Traffic: 2090 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6