I often get this question from collaborators and PIs trying to plan their experiments and budgets. How much coverage is sufficient for an RNA-seq experiment?
One problem with this question is that having a single meaningful coverage value is difficult for RNAseq. Any sample might have a different total amount of transcription, different numbers of transcribed genes/transcripts, different amount of transcriptome complexity (more or less alternate expression) and a different distribution of expression levels for those transcripts. Not to mention common confounding factors like 3' end-bias. All of these factors effectively alter the denominator for any overall coverage calculation. More useful metrics in my opinion are things like total number of reads (and percent of those which map to transcriptome) and total number of transcripts detected with at least X% of junctions with at least X coverage. We usually target at least 10k transcripts with at least 50% of their junctions with at least 10 or 20x coverage. That is approximately what we currently get from a single hiseq lane of 200-300M reads.
But, how much coverage is sufficient? It's even harder to answer this as it really depends on what you are hoping to accomplish. If you only need gene expression levels equivalent to say an Affymetrix gene expression array then it is probably more than sufficient. Same if you only want to validate variants in medium to highly expressed genes. But, I would argue that if that's all you want, then don't waste time/money with RNAseq. What we hope to get from RNA-seq are the above two items plus also confirm variants in lower expressed genes, get good estimates of expressed VAFs, identify lowly or rarely expressed tumor-specific isoforms, show significant differences between alternative splicing patterns, etc. For all these purposes, the one hiseq lane described above is just enough to get us started in my opinion. At present I think it is a good compromise between cost and benefit. But, as prices go down for sequencing we will want to increase it, not decrease it.
We recently found a known promoter mutation (TERT) in some tumors (HCC) we were studying. The mutation is predicted to increase binding of a transcription factor and has been shown to drive subtle but significant 2-4 fold increases in transcription. When we look at expression levels for this gene in RNAseq data we just barely detect it. In fact, the FPKM levels would normally be considered in the noise range. A typical filter of FPKM>1 across at least 20% of samples would eliminate this gene before even testing for a significant difference between normal/tumor or mutant/wildtype. This is a very important cancer gene, with a known mutation causing functional up-regulation that is almost undetectable at current depth levels if we don't already know to look for it! So, I argue that more depth is still needed (cost permitting). Would love to hear other people's thoughts on this.