Question: Confidence Interval in R
0
vinayjrao170 wrote:

Hi,

I am dealing with RNA Seq data, where I am studying if the expression of a set of genes is subtype specific. For this, I have plotted the expression of the genes across the different subtypes, but to make sure if the data is significant, I wish to perform a t-test on the samples where I see a difference between any two molecular subtypes.

I performed the t-test with the `t.test()` function in R and I have two questions regarding the output -

First, at times it gives me an accurate p-value, for example, 3.921e-14, but at times it just says p-value< 2.2e-16. Is that the minimum p-value displayed, or can I get an accurate p-value?

Secondly, the confidence interval by default is 0.95, which I changed to 0.99, 0.999 and so on. Yet, I never seem to find any difference in the p-value. To confirm there is no change, I also tried confidence intervals of 0.5 and 0.1.

Any help or advice on the two points would be greatly appreciated.

Thanks.

modified 2.3 years ago by Jean-Karim Heriche22k • written 2.3 years ago by vinayjrao170
2

Look up "edgeR" or "DESeq2" for differential gene expression testing on RNA-seq data. T-test is not appropriate in this situation, due to the way data is distributed (also, you probably need to normalize for sequencing depth between samples).

P-value of a test is not a function of the confidence interval, that is correct.

Thanks for the advice on edgeR and DESeq2. I will surely look into it, but why is t-test not appropriate in this case?

2

In short, every test functions under certain assumptions. Breaking those assumptions breaks the test. RNA-seq data breaks the assumption of the t-test that the data is drawn from a normal distribution. In practice, you'd fail to detect true differences and may "detect" false ones too.

Imagine a data with a clump of points in one corner and an outlier in the other. A t-test, assuming the data is normally distributed, will estimate the mean to lie somewhere in between the outlier and the clump, completely misrepresenting the true distribution (which is likely around the clump). T-test is really comparing this estimated distribution to another estimated distribution, so if the estimate is faulty, so will be the result of the test.

Furthermore, both edgeR and DESeq2 have good methods for normalizing the sequencing depth between samples (FPKM, ie dividing by total number of reads and gene length is not good enough ).

2
Jean-Karim Heriche22k wrote:

On the first point, 2.2e-16 is the machine precision on most computers. In your R terminal, try

``````> .Machine\$double.eps
``````

On the second point, I think you may be confused about what p-values and confidence intervals are. The p-value is a probability which assesses the evidence against the null hypothesis, i.e. the p-value is the probability of getting the observed parameter value or a more extreme one if the null hypothesis is true. The confidence interval is a range of values that contains the true value of the parameter of interest with some level of confidence. When you select a confidence level of .95, this means that 95% of the time, the true value will be in the confidence interval.

Thanks for the summary. Just to clarify I understand well, you are saying that my output may read non-significant at other confidence intervals if I increase the confidence to a higher stringency without changing the p-value?

1