Question

Why does my t-test fail?

0

Entering edit mode

2.6 years ago

C_sinensis ▴ 30

Hello, I am trying to perform a t-test in python to estimate if the expression of a gene is statistically different between two clusters of cells in an scRNA-seq experiment.

When I do the following:

from scipy import stats as st
st.ttest_ind(counts[gene_1], counts[gene_2])

I get a p-value for some of my comparisons. In others, however, I get the following:

Ttest_indResult(statistic=nan, pvalue=nan)

A nan instead of a p-value.

Please correct me if I am wrong, but I think a t-test requires the data to be normally distributed. I have log-transformed my gene counts, which should help with that, but I still get a very zero-inflated distribution. Find below histograms of the expression of a gene in two clusters of cells for which a t-test returns a nan value: p-value = nan

And here is a histogram of the expression of another gene in two clusters of cells for which a t-test returns a valid p-value:

valid p-value It looks like that the reason is the low expression level of the cells that express some of the gene. But I don't understand why this leads to a nan p-value.

I have a number of questions: Am I doing right using a t-test in this case? Is there another test I should be using instead? What should I do with these nan p-values, or how should I avoid them? Any help understanding what is going on would be so useful.

Thanks so much and my apologies in advance for my poor knowledge in statistics.

t-test • 5.3k views

ADD COMMENT • link updated 2.6 years ago by ATpoint 81k • written 2.6 years ago by C_sinensis ▴ 30

0

Entering edit mode

I would suggest that you always try to use wrappers to run statistical tests, I am pretty sure that in Python something like Scanpy has differential testing routines that you can apply. The same goes for Seurat and Bioconductor. It is also not really meaningful to test a single gene but rather test all genes and then see whether your gene survival the FDR correction. After all (sc)RNA-seq is an assay to probe all genes and not just a single one and as such measurements are not independent and it is probably not appropriate to just t-test single genes. Check Scanby if you're in python and do not implement testing yourself, too many caveats.

ADD REPLY • link 2.6 years ago by ATpoint 81k

Ram · Answer 1 · 2021-09-02

A t-test requires normally distributed values if you want the assumptions to be upheld and the results to be reliable. It's often used on non-normal data, most data is not perfectly normal, and you can decide if your data is close enough. In this case, looking at your histograms, I'd say NO, this is not going to give reliable estimates. The idea of the t-test is to evaluate if the means are different, and the mean of your datasets are probably not well defined right now. Mostly zero. Still, the software doesn't know this and will give you results regardless of their validity. The nan means something else failed.

What are counts[gene_1] and counts[gene_2] anyway? Maybe all zero, then there is no way to calculate the variability and no way to say if the means are different. It can't calculate a p-value. But if you were to look and see it's a bunch of 0, you'd know right away with no statistical test that the means are in fact the same. Or the gene was not covered in your experiment and there's no evidence one way or the other. That's why it cant give you a p-value.

Maybe you shouldn't be running a bunch of t-tests; what are you really trying to accomplish here? Differential Gene Expression is a solved problem with many tools available with more advanced hypothesis testing and multiple test correction embedded.

score 1 · Answer 2 · 2021-09-02

The two-sample t-test is not appropriate here. t-test assumes underlying data for each sample is approximately normally distributed. But, your scRNA-seq data does not look approximately normally distributed. scRNA-seq has an abundance of zero expression counts, which may make standard error for t-test close to zero, or you have NaN values in one of your datasets, and hence you get NaN values for test statistics and p value. Read more about t-test here https://www.reneshbedre.com/blog/ttest.html

In this case, you should perform filtering on scRNA-seq to keep the genes with nonzero expression for at least certain cells and run the test again. Read more here for scRNA-seq filtering https://www.nature.com/articles/nmeth.4263