Question: How to calculate standard deviation between replicate in mRNA seq data?
gravatar for vibes1002003
6.2 years ago by
vibes100200330 wrote:

Dear all,

I am working with RNA-seq data first time. I have 3'end mRNA seq data with 8 different liver conditions and each having three replicates. I want to calculate Standard deviation between replicates in order to decide candidate genes for qRT-PCR? I had done differential expression analysis using DESeq tool. Please kindly suggest me what i need to do.

Thanks in advance.

rna-seq gene • 4.9k views
ADD COMMENTlink modified 6.2 years ago by mikhail.shugay3.4k • written 6.2 years ago by vibes100200330
gravatar for mikhail.shugay
6.2 years ago by
Czech Republic, Brno, CEITEC
mikhail.shugay3.4k wrote:


Standard deviation is a metric that depends greatly on sample prep and processing. I think it would be unwise to apply it as a criterion to select a set of genes for post-hoc validation. Rather you should use a P-value cutoff to select DE genes (which DESeq does), and then select your candidats based on a) fold-change provided by DESeq b) biological consideration, like pathway of interest.

Also you have not specified which kind of data you want to calculate SD for. You provide a raw counts table to DESeq and you obtain normalized counts from it, so standard R utils like apply(matrix, 1, sd) should give you a SD vector.

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by mikhail.shugay3.4k

Thanks for quick reply. My data is from stratified liver cancer patients. We want to know SD between replicates so as to choose strongest candiadates as told by my PI. As we also integrating this data with methylation and miRNA outputs. Anymore more input is highly appreciable.

ADD REPLYlink written 6.2 years ago by vibes100200330

-We want to know SD between replicates so as to choose strongest candidates as told by my PI

Your PI has got some explaining to do.  Choosing the "best" candidates is a little fishy if he tries to imply that the "best" candidates are representative of the accuracy of an experiment.  Here is part of an old thing I wrote about this.  Maybe it will be helpful.

Designing Validation Experiments

Once differentially expressed genes are identified, it is useful to perform some sort of validation of a handful of genes to ensure that the experimental findings are correct. 

One way to perform technical validation is to design the original experiment so that each biological replicate is divided into two technical replicates.  This protocol ensures that all sequencing runs replicate.  Doing this will also increase the statistical power of the experiment somewhat because it will reduce the effects of technical variance within the experiment.  However, this validation may not be stringent enough for all reviewers because it measures genes using the same technology as the original experiment.

A second way to perform technical validation is to run quantitative RT-PCR analyses on the RNA extracted from the same samples that were measured in the original experiment.  This provides an external measurement, but only speaks to the technical accuracy of the experiment. 

To perform biological validation, it is necessary to perform validation on biological samples that are independent of those used in the original experiment.   While this may not always be practical, it provides much better support for the experimental findings than technical validation alone.

However, the expectation that a randomly drawn set of differentially expressed genes will validate in quantitative RT-PCR analyses to a statistically significant degree is usually unrealistic.  Imagine, for example, that you have performed an experiment using three biological replicates each of a test and control condition. From this, you generate a list of genes differentially expressed at p<0.01.  You choose a gene for validation.  If the gene was truly differentially expressed by 2X, you may have only had a 50% chance of identifying it as differentially expressed at p<0.01.  Therefore, assuming that the quantitative RT-PCR validation experiment also has 3 replicates, then there may only be about a 50% chance that the gene will be detected at differentially expressed at p<0.01 in the validation samples. This can occur even with perfect measurements because there will still be a high degree of biological variance among samples.

To account for this, people sometimes select genes for validation non-randomly, choosing only genes that have very strong differential expression.  While these genes will likely validate, this approach to validation may not be very informative.  In our paper on yeast expression we performed a validation where we measured validation by the expectation that gene expression levels would fall within a certain confidence interval.  

Views on the amount of validation that is required for RNA Seq experiments differ. While it is certainly true that some runs fail, the technical reproducibility of RNA Seq experiments has led some to suggest that even RNA Seq technical validation is not necessary.  The way in which RNA Seq experiments fail seems to be different than the way experiments fail using microarrays.  In microarray experiments, because the probes for genes are were physically spotted across the chip, it is possible for only a handful of genes to fail to measure properly (i.e. if there was a bit of contamination on the chip).  The limited number of RNA-Seq runs that we have seen fail did so spectacularly, affecting the whole run.   This scenario seems to support the idea that RNA Seq may require less validation than microarrays, but if the work is for publication your paper may not be given to three reviewers who agree that no validation is necessary.  


No one ever does biological replicates.  Well, we did something like that here, in supplemental figure S6 but it was kind of devised after the fact when we young and naive.  But that should give you an idea of what real validation data looks like, i.e. it does not and is not expected to validate perfectly.  If it does the validation is probably fishy (but not all reviewers understand that).

ADD REPLYlink written 6.2 years ago by Michele Busby2.1k

Thank you very much for such a descriptive reply. Regards Rahul

ADD REPLYlink written 6.2 years ago by vibes100200330

Hmm it appears that your PI wants some kind of ANOVA test for your genes.. Maybe this could be of some help for you, although it is quite a complex paper:

and corresponding R package:

ADD REPLYlink written 6.2 years ago by mikhail.shugay3.4k

I am the author of the PLoS ONE paper mentioned. It is rather complex, there is a more friendly description of the tool at:

I think you will find the tool fairly simple to use.

ADD REPLYlink written 6.2 years ago by ggloor0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1559 users visited in the last hour