Question: Genomic Read Mapping Biased Towards Coding Regions?
3
gravatar for Vitis
6.3 years ago by
Vitis1.6k
New York
Vitis1.6k wrote:

I'm trying to map genomic sequencing reads (Illumina HiSeq PE100) to a related reference genome. The coding region divergence is about 1% between the organism and the reference, so I allowed 5~8 mismatches in 100bp reads as well as allowing small indels, hoping this could accommodate the higher divergence expected outside the exons. But in the coverage plot, coding regions still got the most coverage. This bias is so severe that it looks like an mRNA-Seq experiment. Of course, there are regions with relatively uniform coverage outside the exons (so they should be true genomic reads), but they're much rarer than the coverage 'deserts' elsewhere. The overall coverage, based on kmers, is about 5X, which can be a reason why this is happening. Also, is there anything wrong I did in terms of the way I approach the mapping process?

ADD COMMENTlink modified 6.3 years ago by Stefano Berri4.0k • written 6.3 years ago by Vitis1.6k

Maybe you could just say which organisms you are comparing and how distant they are.

ADD REPLYlink written 6.3 years ago by Michael Dondrup44k

vitis, is this whole genome shotgun data or some reduced representation library that you have sequenced?

ADD REPLYlink written 6.3 years ago by SES8.1k

These are whole genome shotgun sequences, so shouldn't be biased in terms of genome compositions.

ADD REPLYlink written 6.3 years ago by Vitis1.6k
5
gravatar for Stefano Berri
6.3 years ago by
Stefano Berri4.0k
Cambridge, UK
Stefano Berri4.0k wrote:

The problem is that you are mapping to a "related reference genome". Clearly, coding regions are much more conserved than intragenic or introns, so reads from exons map a lot better. I suspect only a relatively small fraction of your reads maps.

you will need some sort of de-novo or use your related reference genome as a scafolding (but it will not be a task of a day...)

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Stefano Berri4.0k

Indeed, this sounds reasonable. Imo the mapping 'bias' is due to conservation. If you had both genome sequences and made a conservation plot, then I would bet that the mapping correlates with the conservation. In a sense this result is not really surprising.

ADD REPLYlink written 6.3 years ago by Michael Dondrup44k

the idea was to capture sequences outside coding regions, because we have coding sequences from mRNA-Seq. de novo didn't work well because the overall coverage was relatively low. It makes sense that mapping correlates with conservation but the point was by allowing more mismatches maybe the correlation can be relaxed, which didn't happen.

ADD REPLYlink written 6.3 years ago by Vitis1.6k

5-8% miss matches is very low. Could work for Human vs Chimpazee, but as soon as you go further it does not hold. I suspect that for some species, especially plants, 5-8% could be within the same species.

ADD REPLYlink written 6.3 years ago by Stefano Berri4.0k

These two are within the same genus, but definitely further away than Human/Chimp. Looks like I underestimated the divergence in the non-coding regions.

ADD REPLYlink written 6.3 years ago by Vitis1.6k

I just got some Sanger sequencing results from (the ancient technology of) genome walking, which are very interesting: highly heterogeneous in terms of genomic divergence, as low as no difference to 12% divergence. Have to think of a good way to accommodate this in mapping.

ADD REPLYlink written 6.3 years ago by Vitis1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1590 users visited in the last hour