Question: Varying Sample Size vs Read Depth, Read Length, and Single vs Paired end to optimize DE analysis of RNA-Seq
gravatar for Vincent Laufer
3.4 years ago by
Vincent Laufer1.0k
United States
Vincent Laufer1.0k wrote:

I am ready to order mRNA sequencing on case and control rats for an experiment I am conducting measuring transcription levels of genes in rats. I'd like to identify differentially expressed (DE) transcripts and correlate mRNA transcript expression level with micro-RNA expression levels obtained in a previous experiment. I need to be able to detect differences in expression of about ~1.5x reliably from brain tissue (amygdala, etc.) I also need to be able to detect splice variants, which I think means I need longer reads and possibly paired end reads.

Like many researchers, I am limited by cost. In my preliminary studies, I think I have identified three key variables that  I think will increase cost efficacy (and therefore statistical power) of my study, they are:

1) I can increase sample size at the cost of trading-off read depth

2) I can increase read length to get better mapping, but it will cost more

3) I can opt for paired end reads (PE), which will cost more than SE but will map better and might help more with detecting splice variants.

Reading through the literature, I think what I have figured out is that it makes a lot more sense to do a higher number of cases and controls at lower depth for 1).

However, for 2) and 3), I cannot assess the degree of the trade off between single vs paired end read, or optimal read length that gives me combination of (mean mapping efficiency)*(cost efficiency per read) = highest total mapped reads per unit price.  A complicating factor I don't understand well is how good the quality of the rat reference genome is, and how that will influence my mapping efficiency with respect to optimizing read length and SE vs PE.

Summary - If in the end I am interested in Differential Expression analysis primarily, but also want to be able to learn about splice variants, then what is the best combination of read depth, read length, and single vs paired ends? Therefore, ultimately my goal is to optimize statistical power to detect association of DE genes - what is the optimal experimental design?

I recognize there may be no "right" answer to this, but answers from your experience would be helpful to me. Thank you very much in advance.

ADD COMMENTlink modified 14 months ago by Biostar ♦♦ 20 • written 3.4 years ago by Vincent Laufer1.0k
gravatar for harold.smith.tarheel
3.4 years ago by
United States
harold.smith.tarheel4.3k wrote:

You're correct that there's no "right" answer (except "more data, better detection"). And the optimal data for differential expression and isoform detection are different, actually somewhat contradictory (more reads vs. longer/paired-end reads).

Re: variable 1, you're correct that more replicates will provide more statistical power. At some number, the marginal improvement in detection will be outweighed by the cost of library construction. This reference provides a cost-benefit analysis that may be useful.

Re: variable 2, increased read length beyond 50bp provides only a modest improvement (a few %) in alignment and offers little benefit for DE. However, it does increase the likelihood of spanning a splice junction, which is necessary for isoform analysis.

Re: variable 3, paired-end reads only modestly improve mapping, but greatly improve isoform detection.

However, the most important variable is the state of the rat genome/transcriptome. If it's fairly complete, then mapping for DE and isoforms is appropriate. If it is relatively incomplete, then you would need to construct transcript models from your data. That would virtually necessitate the use of paired-end 100bp (or longer) reads. Therefore, I would strongly encourage you to evaluate the existing rat data (e.g., the Bodymap study from SEQC) before beginning your experiment.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by harold.smith.tarheel4.3k

If cost is the main driver here, I would avoid doing the splice variant analysis. Splice variant calling with Illumina data is very hit and miss and is hugely dependent on the reference having the splice isoforms in the first place. You will need significantly more reads to do splice variant analysis as each splicing event requires enough reads to be able to be counted as opposed to each gene needing enough reads. There are an order of magnitude more splice junctions in rat than there are genes. Hence more data required.

75pb or 100bp PE is the current sweet spot for RNA-seq DGE. Note it must be stranded. We usually do 5 or 6 reps per condition which gives very robust data for analysis.

Also, I'd recommend against ribodepletion protocols as you'll have lots pre-mRNAs in the sample which may muddy the gene counts.

ADD REPLYlink written 3.4 years ago by Chris Cole680

Thanks for your sharing your experience with splice variant analysis. I have 0 experience with splice variant work and actually have no intention of doing those analyses myself. Right now the discussion revolved around do we want to spend a little extra money on longer PE sequencing on the chance that someone in our lab in the future could use this data set to ask those types of questions. I'm honestly still undecided on this front, the difference between 50PE and 100PE for my experiment is ~$1000. We may end up going for the 100PE just so we do not regret not having done it.


I also agree with you on not using ribodepletion protocols. My samples are of high quality so we are going to do poly(A) selection.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Vincent Laufer1.0k

Thank you for sending the Liu paper, it is pushing me more towards prioritizing sample size over read depth. I've already shared it with several of my other colleagues as I think it will help them better design their own RNA-seq experiments. 


The Rat BodyMap is another very interesting paper. It will help me a lot with my own research and and plans for data analysis. I believe the rat genome is sufficiently well annotated for the type of DE analyses I plan on doing. For example, in the BodyMap paper they report that they were able to uniquely map 88.5% of their reads using 50SE.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Vincent Laufer1.0k
gravatar for Carlo Yague
3.4 years ago by
Carlo Yague4.4k
Carlo Yague4.4k wrote:

1) You are right, more replicates will generally increase the power of your differential tests more than increased read depth. The exception is for low expressed genes that require higher depth to be robustly detected.

2) Don't increase read length too much (no more than 100 bp). Longer reads tend to be of lower quality (at least with the current illumina technology) and you'll profit more from shorter paired-end reads than from longer SE reads.

3) I would go 100% for paired-end ! As you said, mapping is easier and it'll help a lot with splice variant detection.

? 4) Another parameter is what you are sequencing. Is it (A)total RNA ? (B)polyA-selected RNA ? (C)rRNA depleted RNA ? Depending on what you choose, it will impact the depth required to robustly measure expression.

A) total RNA is probably a bad idea since 99% of your reads will map on rRNA or tRNAs...

B) polyA selection is great but you can loose information on non-polyA transcripts (some ncRNAs, etc...). But if you are only interested in mRNA, then its fine.

C) rRNA depletion is cool because you only exclude rRNA from your RNA sample. However sometimes it is not 100% efficient and you'll still sequence rRNA (up to 40% of total reads in my experience). So you'll probably need a bit more depth than from poly-A selected samples in that case.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Carlo Yague4.4k

Thanks for the response. Consensus seems to forming around shorter PE sequencing. So I appreciate everyone educating me on the pros and cons of these different options.


We plan on sequencing Poly(A) selected RNA. I've already got a seq data set of small ncRNA so I am not concerned about losing that data. I am really most interested in the mRNA at this stage. It's been my experience (well the experience of the sequence company we're using) that Poly(A) selection is very efficient at reducing rRNA reads. I am also not worried about degraded RNA in my samples so I think Poly(A) selection is the way to go (it's also significantly cheaper!). Thanks again for your input, it is appreciated. 

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Vincent Laufer1.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1523 users visited in the last hour