Why Are There Many Rna-Seq Hits To Intronic Regions?
8
56
Entering edit mode
12.0 years ago
lh3 33k

I am looking at RNA-seq data, which I have little experience in. I notice that for many genes, there are reliable alignments (i.e. with high mapping quality) to introns. I understand that some of them are due to unannotated transcripts, but in many regions, this does not seem to be the major cause. The intronic read hits do not seem to be purely caused by alignments artifacts, either, because the pattern is tissue specific (though this is not a compelling evidence). Another possible explanation is that this observation is due to noisy transcripts (Pickrell et al, 2010), but this seems to be a big effect: for some long genes, there are far more intronic hits than exonic hits.

I guess those who study RNA-seq data must have noticed the intronic hits for years. What is cause of the large amount of intronic read hits? Is it caused by alignment/library prep artifacts or noisy transcription? Are there papers addressing this? Thanks.

EDIT: my conclusion. I was looking at ERR030882 from Illumina BodyMap (brain). The sample were processed with oligo-dT. I am using the gencode exon annotations, including all the pseudogenes, lincRNA and known processed transcripts, totalling ~112Mbp. The initial analysis reveals ~80% of bases mapped to exons. Nonetheless, if I only look at read pairs with insert size larger than 311bp (~10% of the original data), 98.2% of these spliced read pairs are mapped to known exons, suggesting that the vast majority of the intronic and intergenic read pairs are unspliced. It is possible that some unspliced pairs come from unknown single-exon transcripts with intact polyA tail, but contaminations seem the leading cause overall.

rna-seq • 44k views
ADD COMMENT
0
Entering edit mode

Hi, could you accept an answer here, please.

ADD REPLY
30
Entering edit mode
12.0 years ago

Two simple reasons:

1) Genomic DNA has contaminated the RNA-Seq sample, likely at the mRNA isolation step. This would look like sequence data from both strands of the intron.

2) There is unspliced mRNA in the sample. This would give data for the strand that encodes the gene, but within that intron.

There could be other explanations, but these are two principle ones that come to mind.

ADD COMMENT
6
Entering edit mode

Agreed. These are the most likely explanations. The original post mentions a "large amount of intronic read hits" but does not quantify. In polyA-selected RNA-seq libraries from cell line samples (i.e., unlimited high quality template) we tend to see ~2% of all reads aligning to intronic regions and another ~1.5% aligning to intergenic regions. However, depending on sample quality, library construction method, etc you can sometimes see numbers as high as 30%-40% and occasionally even worse. My impression that this is mostly driven by contamination by genomic DNA and unprocessed RNA. However, a minority are also due to unannotated exons and genuine splicing events.

ADD REPLY
0
Entering edit mode

3.5% is quite impressive given that 20% of bases from the data I am playing with (ERR030882, prepared with oligo-dT) are not mapped to the union of all gencode exons, including pseudogenes and lincRNA. Have your data been published or do you know any published data are of high quality? I may play with your high-quality data first to see what I should expect. Thanks.

ADD REPLY
0
Entering edit mode

When I said that 3.5% of total reads mapped to intron/intergenic I did not mean to imply that the remainder mapped to known/annotated exons/transcripts. An additional ~9% were considered either duplicate, low complexity, low quality or repeat sequences and filtered out. A whopping ~25% did not map at all. Only ~60% of total reads mapped to known transcripts with ~2-3% of reads mapping to novel (previously unannotated) junctions or boundaries. If we only consider reads that pass quality filters and were mapped somewhere, then 90.7% went to known transcripts (exons or junctions), 3.2% to introns, 2.4% to intergenic, and 3.7% to novel junctions or boundaries. All these numbers are from a set of 67 breast cancer cell lines, sequenced on Illumina GAII, and processed with Alexa-seq. Unfortunately now published yet.

ADD REPLY
1
Entering edit mode

Presumably, we would expect even read depth across all non-exonic regions if there are DNA contaminations and expect stable intronic:exonic read depth ratio for each gene if there are pre-mRNA contaminations. We should be able to get a rough estimate of both DNA and pre-mRNA contamination rates. Are there tools to do this?

ADD REPLY
1
Entering edit mode

Not necessarily even read-depth across the genome in this case. If there is a restriction enzyme step involved in the mRNA -> cDNA -> cloning steps (library prep), there could be some genomic regions lacking that enzyme recognition site. Even if they contaminated the library prep, they wouldn't be cut and wouldn't be cloned/sequenced.

ADD REPLY
0
Entering edit mode

I see. Thanks for the explanation.

ADD REPLY
0
Entering edit mode

I agree. The first point is a common QC step for rna-seq analysis as well

ADD REPLY
0
Entering edit mode

Thanks! Yes, these are common QC steps, plus that to look for (but hopefully not find much) rRNA/rDNA in the data.

ADD REPLY
0
Entering edit mode

There is another possibility, which I don't see mentioned here, which is that the sequences may arise from stable lariat RNA. Stable lariats have been observed in polyA minus fractions of RNA, and could presumably be observed in incompletely selected polyA+ RNA.

ADD REPLY
0
Entering edit mode

Is there a good method to remove the genome contamination and pre-mRNA?

ADD REPLY
16
Entering edit mode
12.0 years ago
Michael 54k

This has been described in the literature, and it seems to be widely accepted as a matter of fact that non-exonic, or intronic, transcripts are prevalent. A hypothesis to explain the prevalence is that they harbour functional non-coding RNA. See eg. Kapranov et al. (2011)

By RNA mass in a human cell, transcripts emanating from intronic sequences approximately equal that of exonic sequences but this large amount of intronic sequence cannot be explained just by the fact that introns are longer and, thus, accumulate more reads. The density of reads from individual introns can be quite abundant and similar to, or higher than, that of exonic regions. This is exemplified by the known ncRNA KCNQ1OT1 embedded within the protein-coding KCNQ1 locus and transcribed from the opposite strand, indicating it is not simply a splicing artifact (Figure 3). Additional examples in loci not currently known to harbour ncRNAs are shown on Figure 4b.

Also, the tissue specific transcription of ncRNA seems to be in line with what is described in the literature.

ADD COMMENT
1
Entering edit mode

I thought about this (actually this is the first I thought; I would love to see this explains the observation for the gene I am interested in). Nonetheless, for my data, there are so many intronic reads that I can hardly believe it is dominated by something biological.

ADD REPLY
14
Entering edit mode
12.0 years ago

There is at least this paper by Ameur et al which addresses this issue. They show that part of the intronic alignments reflect nascent transcription and co-transcriptional splicing.

Edit: It will make a difference whether you use poly-A selection or ribosomal RNA depletion, as discussed in the paper.

ADD COMMENT
0
Entering edit mode

I'm studying splicing using RNA-seq... so I HAVE TO read this paper... THANKS!

ADD REPLY
0
Entering edit mode

Thanks a lot. This is just the right paper I am looking for! My data were produced using oligo-dT. The fraction of exonic hits is quite similar to the one described in the paper (~80% in chimp adult brain). I still need to check whether DNA/pre-mRNA contamination or nascent RNAs are the leading cause of the remaining 20%.

ADD REPLY
0
Entering edit mode

Hi Heng, did you get any idea how to check whether DNA/pre-mRNA contamination or nascent RNAs lead the intergenic mapped reads? I am facing the same problem to explain the non-exonic mapper. Thanks.

ADD REPLY
9
Entering edit mode
12.0 years ago
adam.ugc ▴ 90

I'm the first author of the NSMB paper mentioned by Mikael Huss above (Ameur et al 2011) and I tried to start a seqanswers thread on this topic a while ago, but it never got going (see http://seqanswers.com/forums/showthread.php?t=15296). Basically, our results suggest that most of the intronic reads in total RNA from human brain comes from nascent RNAs, i.e. genes that are being transcribed but where the polymerase has not yet reached the end of the gene. This also explains why longer introns have higher RNA-seq coverage compared to shorter introns. I would be happy if people are interested to discuss this topic further in this forum or at seqanswers. Personally, I think these intronic reads are really exiting and that they can be of great importance for the analysis and interpretation of RNA-seq data.

ADD COMMENT
0
Entering edit mode

Your observation that "longer introns have higher RNA-seq coverage compared to shorter introns" implies that mRNA splicing occurs on a per intron basis. I don't doubt that, but just would like to confirm that is true, compared to a full-length pre-mRNA undergoing all splicing once transcript synthesis is complete. Plucking out introns one at a time brings up the possibility of different forms of regulation (intron1 done differently than another intron) as well as differential compartmentalization and post-processing (eg, microRNAs). I agree - exciting. Introns may not always be waste, but input.

ADD REPLY
1
Entering edit mode

We have validated by PCR, for about 10 genes in brain and liver, that introns are spliced soon after they have been transcribed. Also, based on our global analyses of intronic RNA-seq coverage it seems like co-transcriptional splicing is a very common event, at least in our samples. So maybe it could be that co-transcriptional splicing is the rule and post-transcriptional splicing is the exception...

ADD REPLY
0
Entering edit mode

Thank you for your example.

ADD REPLY
6
Entering edit mode
12.0 years ago
John St. John ★ 1.2k

Check out this thread on SeqAnswers, it might provide some insight: http://seqanswers.com/forums/showthread.php?t=5519

What organism are you mapping to? Are you pretty confident in the gene model? From the seqanswers form it sounds like you aren't be the only person experiencing this issue.

ADD COMMENT
0
Entering edit mode

The link is very useful. Thanks. I am looking at human data and pooling all gencode exons. I believe the annotation should be relatively complete. I am sure most who have looked into RNA-seq data will have my question to some extend at some stage.

ADD REPLY
3
Entering edit mode
12.0 years ago
Wen.Huang ★ 1.2k

I was part of the SeqAnswers discussion John St. John mentioned and I still believe that unspliced pre-mRNA is a substantial, if not leading factor. Many of the ncRNAs are expressed at very low level, it is true that they show up here and there in the genome but they won't account for a large fraction. Imagine that introns are about 20 times as long as exons in general. So even a 5% pre-mRNA contamination can give you as many intronic reads as exonic reads! Of course even if pre-mRNAs do exist, they are usually either not polyadenylated completely or partially spliced or partially degraded, so they don't show up in the final sequenced library that often depending on library prep protocols. But exons and introns have such a large difference in length that even a very small carry-over would have a big effect.

ADD COMMENT
2
Entering edit mode
12.0 years ago
Eric Fournier ★ 1.4k

I would not know about papers addressing this issue directly, but intronic reads could be the result of incomplete splicing (Intron was not removed from the transcript) or simply be introns that were spliced out, and were captured and sequenced before the cell could go about degrading them.

ADD COMMENT
1
Entering edit mode
10.8 years ago
pd3 ▴ 350

This MIT news article gives a hint about one possible mechanism. Apparently DNA transcription initially starts in both directions, of which one is aborted at some point. A link to the paper: doi:10.1038/nature12349

ADD COMMENT

Login before adding your answer.

Traffic: 1546 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6