Question: Why Are There Many Rna-Seq Hits To Intronic Regions?
35
gravatar for lh3
5.1 years ago by
lh329k
United States
lh329k wrote:

I am looking at RNA-seq data, which I have little experience in. I notice that for many genes, there are reliable alignments (i.e. with high mapping quality) to introns. I understand that some of them are due to unannotated transcripts, but in many regions, this does not seem to be the major cause. The intronic read hits do not seem to be purely caused by alignments artifacts, either, because the pattern is tissue specific (though this is not a compelling evidence). Another possible explanation is that this observation is due to noisy transcripts (Pickrell et al, 2010), but this seems to be a big effect: for some long genes, there are far more intronic hits than exonic hits.

I guess those who study RNA-seq data must have noticed the intronic hits for years. What is cause of the large amount of intronic read hits? Is it caused by alignment/library prep artifacts or noisy transcription? Are there papers addressing this? Thanks.

EDIT: my conclusion. I was looking at ERR030882 from Illumina BodyMap (brain). The sample were processed with oligo-dT. I am using the gencode exon annotations, including all the pseudogenes, lincRNA and known processed transcripts, totalling ~112Mbp. The initial analysis reveals ~80% of bases mapped to exons. Nonetheless, if I only look at read pairs with insert size larger than 311bp (~10% of the original data), 98.2% of these spliced read pairs are mapped to known exons, suggesting that the vast majority of the intronic and intergenic read pairs are unspliced. It is possible that some unspliced pairs come from unknown single-exon transcripts with intact polyA tail, but contaminations seem the leading cause overall.

rna-seq • 20k views
ADD COMMENTlink modified 13 months ago by Michael Dondrup41k • written 5.1 years ago by lh329k

Hi, could you accept an answer here, please.

ADD REPLYlink written 13 months ago by Michael Dondrup41k
20
gravatar for Larry_Parnell
5.1 years ago by
Larry_Parnell15k
Boston, MA USA
Larry_Parnell15k wrote:

Two simple reasons:

1) Genomic DNA has contaminated the RNA-Seq sample, likely at the mRNA isolation step. This would look like sequence data from both strands of the intron.

2) There is unspliced mRNA in the sample. This would give data for the strand that encodes the gene, but within that intron.

There could be other explanations, but these are two principle ones that come to mind.

ADD COMMENTlink written 5.1 years ago by Larry_Parnell15k
5

Agreed. These are the most likely explanations. The original post mentions a "large amount of intronic read hits" but does not quantify. In polyA-selected RNA-seq libraries from cell line samples (i.e., unlimited high quality template) we tend to see ~2% of all reads aligning to intronic regions and another ~1.5% aligning to intergenic regions. However, depending on sample quality, library construction method, etc you can sometimes see numbers as high as 30%-40% and occasionally even worse. My impression that this is mostly driven by contamination by genomic DNA and unprocessed RNA. However, a minority are also due to unannotated exons and genuine splicing events.

ADD REPLYlink written 5.1 years ago by Obi Griffith15k

3.5% is quite impressive given that 20% of bases from the data I am playing with (ERR030882, prepared with oligo-dT) are not mapped to the union of all gencode exons, including pseudogenes and lincRNA. Have your data been published or do you know any published data are of high quality? I may play with your high-quality data first to see what I should expect. Thanks.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by lh329k

When I said that 3.5% of total reads mapped to intron/intergenic I did not mean to imply that the remainder mapped to known/annotated exons/transcripts. An additional ~9% were considered either duplicate, low complexity, low quality or repeat sequences and filtered out. A whopping ~25% did not map at all. Only ~60% of total reads mapped to known transcripts with ~2-3% of reads mapping to novel (previously unannotated) junctions or boundaries. If we only consider reads that pass quality filters and were mapped somewhere, then 90.7% went to known transcripts (exons or junctions), 3.2% to introns, 2.4% to intergenic, and 3.7% to novel junctions or boundaries. All these numbers are from a set of 67 breast cancer cell lines, sequenced on Illumina GAII, and processed with Alexa-seq. Unfortunately now published yet.

ADD REPLYlink written 5.1 years ago by Obi Griffith15k

I agree. The first point is a common QC step for rna-seq analysis as well

ADD REPLYlink written 5.1 years ago by Bioinfosm610

Thanks! Yes, these are common QC steps, plus that to look for (but hopefully not find much) rRNA/rDNA in the data.

ADD REPLYlink written 5.1 years ago by Larry_Parnell15k

Presumably, we would expect even read depth across all non-exonic regions if there are DNA contaminations and expect stable intronic:exonic read depth ratio for each gene if there are pre-mRNA contaminations. We should be able to get a rough estimate of both DNA and pre-mRNA contamination rates. Are there tools to do this?

ADD REPLYlink written 5.1 years ago by lh329k

Not necessarily even read-depth across the genome in this case. If there is a restriction enzyme step involved in the mRNA -> cDNA -> cloning steps (library prep), there could be some genomic regions lacking that enzyme recognition site. Even if they contaminated the library prep, they wouldn't be cut and wouldn't be cloned/sequenced.

ADD REPLYlink written 5.1 years ago by Larry_Parnell15k

I see. Thanks for the explanation.

ADD REPLYlink written 5.1 years ago by lh329k

There is another possibility, which I don't see mentioned here, which is that the sequences may arise from stable lariat RNA. Stable lariats have been observed in polyA minus fractions of RNA, and could presumably be observed in incompletely selected polyA+ RNA.

ADD REPLYlink written 4.5 years ago by Wjeck480

Is there a good method to remove the genome contamination and pre-mRNA?

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by camelbbs580
16
gravatar for Michael Dondrup
5.1 years ago by
Bergen, Norway
Michael Dondrup41k wrote:

This has been described in the literature, and it seems to be widely accepted as a matter of fact that non-exonic, or intronic, transcripts are prevalent. A hypothesis to explain the prevalence is that they harbour functional non-coding RNA. See eg. Kapranov et al. (2011)

By RNA mass in a human cell, transcripts emanating from intronic sequences approximately equal that of exonic sequences but this large amount of intronic sequence cannot be explained just by the fact that introns are longer and, thus, accumulate more reads. The density of reads from individual introns can be quite abundant and similar to, or higher than, that of exonic regions. This is exemplified by the known ncRNA KCNQ1OT1 embedded within the protein-coding KCNQ1 locus and transcribed from the opposite strand, indicating it is not simply a splicing artifact (Figure 3). Additional examples in loci not currently known to harbour ncRNAs are shown on Figure 4b.

Also, the tissue specific transcription of ncRNA seems to be in line with what is described in the literature.

ADD COMMENTlink written 5.1 years ago by Michael Dondrup41k
1

I thought about this (actually this is the first I thought; I would love to see this explains the observation for the gene I am interested in). Nonetheless, for my data, there are so many intronic reads that I can hardly believe it is dominated by something biological.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by lh329k
13
gravatar for Mikael Huss
5.1 years ago by
Mikael Huss4.4k
Stockholm
Mikael Huss4.4k wrote:

There is at least this paper by Ameur et al which addresses this issue. They show that part of the intronic alignments reflect nascent transcription and co-transcriptional splicing.

Edit: It will make a difference whether you use poly-A selection or ribosomal RNA depletion, as discussed in the paper.

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Mikael Huss4.4k

I'm studying splicing using RNA-seq... so I HAVE TO read this paper... THANKS!

ADD REPLYlink written 5.1 years ago by Geparada1.2k

Thanks a lot. This is just the right paper I am looking for! My data were produced using oligo-dT. The fraction of exonic hits is quite similar to the one described in the paper (~80% in chimp adult brain). I still need to check whether DNA/pre-mRNA contamination or nascent RNAs are the leading cause of the remaining 20%.

ADD REPLYlink modified 5.1 years ago • written 5.1 years ago by lh329k

Hi Heng, did you get any idea how to check whether DNA/pre-mRNA contamination or nascent RNAs lead the intergenic mapped reads? I am facing the same problem to explain the non-exonic mapper. Thanks.

ADD REPLYlink written 2.3 years ago by Xianjun190
8
gravatar for adam.ugc
5.1 years ago by
adam.ugc80
adam.ugc80 wrote:

I'm the first author of the NSMB paper mentioned by Mikael Huss above (Ameur et al 2011) and I tried to start a seqanswers thread on this topic a while ago, but it never got going (see http://seqanswers.com/forums/showthread.php?t=15296). Basically, our results suggest that most of the intronic reads in total RNA from human brain comes from nascent RNAs, i.e. genes that are being transcribed but where the polymerase has not yet reached the end of the gene. This also explains why longer introns have higher RNA-seq coverage compared to shorter introns. I would be happy if people are interested to discuss this topic further in this forum or at seqanswers. Personally, I think these intronic reads are really exiting and that they can be of great importance for the analysis and interpretation of RNA-seq data.

ADD COMMENTlink written 5.1 years ago by adam.ugc80

Your observation that "longer introns have higher RNA-seq coverage compared to shorter introns" implies that mRNA splicing occurs on a per intron basis. I don't doubt that, but just would like to confirm that is true, compared to a full-length pre-mRNA undergoing all splicing once transcript synthesis is complete. Plucking out introns one at a time brings up the possibility of different forms of regulation (intron1 done differently than another intron) as well as differential compartmentalization and post-processing (eg, microRNAs). I agree - exciting. Introns may not always be waste, but input.

ADD REPLYlink written 5.1 years ago by Larry_Parnell15k
1

We have validated by PCR, for about 10 genes in brain and liver, that introns are spliced soon after they have been transcribed. Also, based on our global analyses of intronic RNA-seq coverage it seems like co-transcriptional splicing is a very common event, at least in our samples. So maybe it could be that co-transcriptional splicing is the rule and post-transcriptional splicing is the exception...

ADD REPLYlink written 5.1 years ago by adam.ugc80

Thank you for your example.

ADD REPLYlink written 5.1 years ago by Larry_Parnell15k
5
gravatar for John St. John
5.1 years ago by
John St. John970
San Francisco, CA, Cancer Therapeutics Innovation Group
John St. John970 wrote:

Check out this thread on SeqAnswers, it might provide some insight: http://seqanswers.com/forums/showthread.php?t=5519

What organism are you mapping to? Are you pretty confident in the gene model? From the seqanswers form it sounds like you aren't be the only person experiencing this issue.

ADD COMMENTlink written 5.1 years ago by John St. John970

The link is very useful. Thanks. I am looking at human data and pooling all gencode exons. I believe the annotation should be relatively complete. I am sure most who have looked into RNA-seq data will have my question to some extend at some stage.

ADD REPLYlink written 5.1 years ago by lh329k
3
gravatar for Wen.Huang
5.1 years ago by
Wen.Huang1.1k
Wen.Huang1.1k wrote:

I was part of the SeqAnswers discussion John St. John mentioned and I still believe that unspliced pre-mRNA is a substantial, if not leading factor. Many of the ncRNAs are expressed at very low level, it is true that they show up here and there in the genome but they won't account for a large fraction. Imagine that introns are about 20 times as long as exons in general. So even a 5% pre-mRNA contamination can give you as many intronic reads as exonic reads! Of course even if pre-mRNAs do exist, they are usually either not polyadenylated completely or partially spliced or partially degraded, so they don't show up in the final sequenced library that often depending on library prep protocols. But exons and introns have such a large difference in length that even a very small carry-over would have a big effect.

ADD COMMENTlink written 5.1 years ago by Wen.Huang1.1k
2
gravatar for Eric Fournier
5.1 years ago by
Eric Fournier1.4k
Quebec, Canada
Eric Fournier1.4k wrote:

I would not know about papers addressing this issue directly, but intronic reads could be the result of incomplete splicing (Intron was not removed from the transcript) or simply be introns that were spliced out, and were captured and sequenced before the cell could go about degrading them.

ADD COMMENTlink written 5.1 years ago by Eric Fournier1.4k
1
gravatar for pd3
3.9 years ago by
pd3300
pd3300 wrote:

This MIT news article gives a hint about one possible mechanism. Apparently DNA transcription initially starts in both directions, of which one is aborted at some point. A link to the paper: doi:10.1038/nature12349

ADD COMMENTlink written 3.9 years ago by pd3300
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1210 users visited in the last hour