Question: Reliability of gene expression with low exonic counts in bulk RNA-seq data
gravatar for pm2012
3 months ago by
United States
pm201290 wrote:


I am working with a bulk RNA-seq dataset with a very high % of intronic counts (40-60% of mapped reads), which I believe come from pre-mRNA fraction and maybe to a lesser extent intron-retention in some transcripts. I am a bit concerned as the exonic counts for some sample are only about 1 million mapped reads. The tools that I am analyzing the data with only consider exonic reads for calculating gene expression. I am wondering how accurate would my gene expression be. Although the read depth that I sequenced the samples were good for my purpose (~15-20 million reads/sample), the low mapped % in exonic regions makes me a bit concerned. As I understand the genes with high expression would still be OK in this case correct? It is only the ones with low expression levels that's problematic. Any insights would be highly appreciated.

ADD COMMENTlink modified 3 months ago by i.sudbery9.8k • written 3 months ago by pm201290

May I ask how the library prep and alignment was performed?

ADD REPLYlink written 3 months ago by newbio17310

The sample prep method is a novel method currently being developed. I used STAR for alignment.

ADD REPLYlink written 3 months ago by pm201290

When you are referring to high % of intronic counts and etc, are you getting the numbers from from STAR? If that's the case and you're seeing low % of uniquely mapped reads and high % of multi-mapping reads, then I would check for the presence of rRNA. Examining library prep protocol and the samples themselves may provide more insight on this.

In terms of accuracy in measuring gene expression, I'm not sure if you will be able to determine accuracy even when you have more reads considering how RNA-Seq captures the expression levels at a specific time point and could change depending on various factors (any biological/technical replicates?). I would say genes with more coverage should be okay for downstream analyses assuming the uniquely mapped reads aren't spread out too thin.

ADD REPLYlink modified 3 months ago • written 3 months ago by newbio17310

Hi the QC counts were generated using Qualimap ( One of the goals of our study is to understand how our novel method compares to a standard RNA-seq sample prep method. The standard method for the sample samples has much lower intronic reads. Yes I understand the limitations of RNA-seq. However, I wanted to get some insights if sequencing the library more deeply (so that we get more exonic mapped %) would help in this case.

ADD REPLYlink modified 3 months ago • written 3 months ago by pm201290
gravatar for Carlo Yague
3 months ago by
Carlo Yague5.2k
Carlo Yague5.2k wrote:

very high % of intronic counts (40-60% of mapped reads), which I believe come from pre-mRNA fraction

This is very unlikely with bulk RNA, unless you are not doing RNA-seq per se but some kind of nascent RNA-seq (NET-seq for instance). I think this is more likely to come from genomic DNA contamination. Another possibility would be that the library preparation was not strand-specific. I would advise to look at the mapped reads in a genome browser such as IGV as it will tell you where the intronic reads come from. In the case of DNA contamination, there will also be a high level of extragenic signal.

ADD COMMENTlink modified 3 months ago • written 3 months ago by Carlo Yague5.2k

Yes genomic DNA contamination was our initial suspicion. However, the intragenic mapped reads was much smaller number. So we ruled this out. We do not have a strand-specific library so yes we can't tell for certain if it's true intronic signal or from an overlapping lncRNA let's say. I can def survey few candidates in IGV and see where the reads are coming from.

ADD REPLYlink written 3 months ago by pm201290
gravatar for i.sudbery
3 months ago by
Sheffield, UK
i.sudbery9.8k wrote:

40-60% is not unusual if this is total RNAseq rather than poly-A RNAseq, in fact, its fairly average. Remember that introns make up 92% of transcribed sequence, so even a small amount of contamination in terms to numbers of transcripts means a large amount of contamination in terms of reads. Even in polyA-RNAseq, seeing 30-40% is not unusual.

However, it doesn't seem to me that intronic reads are really your problem. If you have 15million reads, and 60% mapped to introns, that should leave you with 6 million exonic reads, not 1 million.

In terms of whether it will be okay.... Yes, it will be better for more highly expressed genes. But at 1 million reads, I think "high" means "very high".

ADD COMMENTlink written 3 months ago by i.sudbery9.8k

The library we have is polyA. For one dataset, I still have 5-6 million exonic counts. So I think it should be ok.

The problem with ~1 million reads is from a different dataset, where I actually noticed high (variable) intergenic reads. The % of intergenic reads seem quite high even when I include different non-coding RNA biotypes. In other datasets, I would get inflated intragenic reads when I used mRNAs only. However the mapped read count dramatically lowered when I used coding+non-coding RNAs in my annotation.

So I am now wondering if gDNA contamination is an issue here as well. In your experience, what is a tolerable amount of gDNA contamination (ideally we would like no contamination but with this method there appears to be some level of contamination, that we cannot totally get rid of)? Considering I only take exonic reads for gene expression, my results could still be affected. However, would this totally render my data as unusable? Are there steps I could take to rescue the data?

ADD REPLYlink written 3 months ago by pm201290
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2150 users visited in the last hour