Question

Reliability of gene expression with low exonic counts in bulk RNA-seq data

0

Entering edit mode

3.7 years ago

pm2012 ▴ 140

Hi,

I am working with a bulk RNA-seq dataset with a very high % of intronic counts (40-60% of mapped reads), which I believe come from pre-mRNA fraction and maybe to a lesser extent intron-retention in some transcripts. I am a bit concerned as the exonic counts for some sample are only about 1 million mapped reads. The tools that I am analyzing the data with only consider exonic reads for calculating gene expression. I am wondering how accurate would my gene expression be. Although the read depth that I sequenced the samples were good for my purpose (~15-20 million reads/sample), the low mapped % in exonic regions makes me a bit concerned. As I understand the genes with high expression would still be OK in this case correct? It is only the ones with low expression levels that's problematic. Any insights would be highly appreciated.

next-gen RNA-Seq gene_expression • 1.8k views

ADD COMMENT • link updated 3.7 years ago by i.sudbery 19k • written 3.7 years ago by pm2012 ▴ 140

1

Entering edit mode

May I ask how the library prep and alignment was performed?

ADD REPLY • link 3.7 years ago by newbio17 ▴ 360

0

Entering edit mode

The sample prep method is a novel method currently being developed. I used STAR for alignment.

ADD REPLY • link 3.7 years ago by pm2012 ▴ 140

0

Entering edit mode

When you are referring to high % of intronic counts and etc, are you getting the numbers from Log.final.out from STAR? If that's the case and you're seeing low % of uniquely mapped reads and high % of multi-mapping reads, then I would check for the presence of rRNA. Examining library prep protocol and the samples themselves may provide more insight on this.

In terms of accuracy in measuring gene expression, I'm not sure if you will be able to determine accuracy even when you have more reads considering how RNA-Seq captures the expression levels at a specific time point and could change depending on various factors (any biological/technical replicates?). I would say genes with more coverage should be okay for downstream analyses assuming the uniquely mapped reads aren't spread out too thin.

ADD REPLY • link 3.7 years ago by newbio17 ▴ 360

0

Entering edit mode

Hi the QC counts were generated using Qualimap (http://qualimap.bioinfo.cipf.es/). One of the goals of our study is to understand how our novel method compares to a standard RNA-seq sample prep method. The standard method for the sample samples has much lower intronic reads. Yes I understand the limitations of RNA-seq. However, I wanted to get some insights if sequencing the library more deeply (so that we get more exonic mapped %) would help in this case.

ADD REPLY • link 3.7 years ago by pm2012 ▴ 140

score 2 · Answer 1 · 2020-08-19

2

Entering edit mode

3.7 years ago

Carlo Yague 8.6k

very high % of intronic counts (40-60% of mapped reads), which I believe come from pre-mRNA fraction

This is very unlikely with bulk RNA, unless you are not doing RNA-seq per se but some kind of nascent RNA-seq (NET-seq for instance). I think this is more likely to come from genomic DNA contamination. Another possibility would be that the library preparation was not strand-specific. I would advise to look at the mapped reads in a genome browser such as IGV as it will tell you where the intronic reads come from. In the case of DNA contamination, there will also be a high level of extragenic signal.

ADD COMMENT • link 3.7 years ago by Carlo Yague 8.6k

0

Entering edit mode

Yes genomic DNA contamination was our initial suspicion. However, the intragenic mapped reads was much smaller number. So we ruled this out. We do not have a strand-specific library so yes we can't tell for certain if it's true intronic signal or from an overlapping lncRNA let's say. I can def survey few candidates in IGV and see where the reads are coming from.

ADD REPLY • link 3.7 years ago by pm2012 ▴ 140

score 1 · Answer 2 · 2020-08-19

1

Entering edit mode

3.7 years ago

i.sudbery 19k

40-60% is not unusual if this is total RNAseq rather than poly-A RNAseq, in fact, its fairly average. Remember that introns make up 92% of transcribed sequence, so even a small amount of contamination in terms to numbers of transcripts means a large amount of contamination in terms of reads. Even in polyA-RNAseq, seeing 30-40% is not unusual.

However, it doesn't seem to me that intronic reads are really your problem. If you have 15million reads, and 60% mapped to introns, that should leave you with 6 million exonic reads, not 1 million.

In terms of whether it will be okay.... Yes, it will be better for more highly expressed genes. But at 1 million reads, I think "high" means "very high".

ADD COMMENT • link 3.7 years ago by i.sudbery 19k

0

Entering edit mode

The library we have is polyA. For one dataset, I still have 5-6 million exonic counts. So I think it should be ok.

The problem with ~1 million reads is from a different dataset, where I actually noticed high (variable) intergenic reads. The % of intergenic reads seem quite high even when I include different non-coding RNA biotypes. In other datasets, I would get inflated intragenic reads when I used mRNAs only. However the mapped read count dramatically lowered when I used coding+non-coding RNAs in my annotation.

So I am now wondering if gDNA contamination is an issue here as well. In your experience, what is a tolerable amount of gDNA contamination (ideally we would like no contamination but with this method there appears to be some level of contamination, that we cannot totally get rid of)? Considering I only take exonic reads for gene expression, my results could still be affected. However, would this totally render my data as unusable? Are there steps I could take to rescue the data?

ADD REPLY • link 3.7 years ago by pm2012 ▴ 140