Question: Mapping target-enriched reads to a transcriptome reference
gravatar for matt.christmas85
6.3 years ago by
matt.christmas8510 wrote:


Hi all,

I have carried out transcriptome sequencing on a plant species and then, from the resulting assembled contigs, have designed capture probes for 970 gene regions. I then carried out hybrid-capture target-enrichment on whole genomic DNA followed by Illumina 100bp paired-end sequencing for 95 samples. My aim is to call SNP variants within these 970 gene regions among all my samples in order to look at neutral as well as adaptive processes. Before I get to this stage though there are a few things I am unsure about:

1) When I map the reads for an individual back to the 970 contig sequences the probes were designed on I only get 15-30% of the reads mapping back, even with low mapping stringencies such as 50% overlap and 80% similarity. Could this be a result of the probes pulling out a lot of stuff outside of what I was targeting, such as introns, promoter regions, etc.?

2) I do not have a reference genome for this species so my plan was to map the reads back to the transcriptome I assembled and call variants based on that. However, as the transcriptome sequences don't contain any introns am I going to have issues with reliably mapping the captured sequences (which may contain parts of introns, promoter regions, etc.) to this transcriptome reference? And could this also be why I seem to be getting a large number of broken pairs in the mappings?

Any help/advice with this would be greatly appreciated!

Thanks, Matt

ADD COMMENTlink modified 6.3 years ago by Sean Davis26k • written 6.3 years ago by matt.christmas8510
gravatar for Sean Davis
6.3 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

Capture efficiencies of hybrid capture vary, but having a capture efficiency of 50% or so would not be unusual.  That, combined with the fact that you are mapping DNA back to RNA could very easily lead to the mapping issues you are seeing.  While it is disheartening to see so much of your data falling through the cracks, I suspect that the data that is mapping is reasonably usable (with the caveat that there may be a significant false positive rate for SNPs at exon boundaries).

ADD COMMENTlink written 6.3 years ago by Sean Davis26k

Thanks Sean, as I suspected. When you've got ~4 million reads per individual, 20% mapping is still a lot of data so, as you say, will still be reasonably usable.

ADD REPLYlink written 6.3 years ago by matt.christmas8510
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2219 users visited in the last hour