Hello, I am trying to perform a t-test in python to estimate if the expression of a gene is statistically different between two clusters of cells in an scRNA-seq experiment.
When I do the following:
from scipy import stats as st st.ttest_ind(counts[gene_1], counts[gene_2])
I get a p-value for some of my comparisons. In others, however, I get the following:
A nan instead of a p-value.
Please correct me if I am wrong, but I think a t-test requires the data to be normally distributed. I have log-transformed my gene counts, which should help with that, but I still get a very zero-inflated distribution. Find below histograms of the expression of a gene in two clusters of cells for which a t-test returns a nan value:
And here is a histogram of the expression of another gene in two clusters of cells for which a t-test returns a valid p-value:
It looks like that the reason is the low expression level of the cells that express some of the gene. But I don't understand why this leads to a nan p-value.
I have a number of questions: Am I doing right using a t-test in this case? Is there another test I should be using instead? What should I do with these nan p-values, or how should I avoid them? Any help understanding what is going on would be so useful.
Thanks so much and my apologies in advance for my poor knowledge in statistics.