Hello,

I have a question about which statistical test to perform for my data. I have count data from tissue that I stained for three markers:

-Marker for my cell type of interest (i.e. type II pneumocyte marker)

-Broad marker of the cell type my cell type of interest belongs to (i.e. epithelial marker)

-Broad marker that marks all cells (i.e. nuclei marker)

I have performed these counts for 3 biological replicates per timepoint (i.e. Timepoint 1 replicate 1, Timepoint 2 replicate 1, etc.). The timepoints/replicates are unpaired. For each sample, I performed counts from multiple regions of interest and then summed it all together for that sample. The summed data looks something like this:

```
data <- data.frame(Sample = c("Timepoint1_Rep1", "Timepoint1_Rep2", "Timepoint1_Rep3", "Timepoint2_Rep1", "Timepoint2_Rep2", "Timepoint2_Rep3"),
Pneumocytes = c(84, 46, 96, 149, 670, 555),
Epithelial = c(292, 110, 248, 351, 1099, 997),
AllCells = c(1214, 799, 2576, 2074, 3253, 3847))
```

I want to see if the number of Pneumocyte cells is significantly different in timepoint 1 vs. timepoint 2, specifically as a fraction of epithelial cells. I'm not sure if the total number of cells will affect the data as well (I'm assuming it does). What test should I perform for this?

Technically, you should run two tests.

First run a multiple regression analysis to determine, if your number of pneumocytes is dependent on the number of epithelial cells respectively all cells or not. Pay attention to the p-values associated with the coefficients for both cell numbers. If either or both p-values are significant (usually less than 0.05), it suggests that the corresponding variable (number of epithelial cells or number of all cells) is significantly related to the number of pneumocytes. If you want to add the categorical variable

timepointto the model term, you would need to fit a General Linear Model (GLM) instead, but I think that would overcomplicate things.In case your multiple regression analysis shows that your number of pneumocytes is dependent, you need to normalize. Otherwise, skip the normalization. To determine if there is a difference between the two time points, I'd favour a non-parametric test, specifically a Mann-Whitney test, because in contrast to the t-test, it doesn't assume a particular value distribution.

Thanks for the response! I performed the multiple regression analysis with:

and got the following:

which I interpret to mean that I should normalize to the number of Epithelial cells, correct? I normalized the counts and then performed the Mann-Whietney test with:

which gave me:

Does this look like an appropriate workflow?

Another thing to add - I also quantified other epithelial cell types and am interested in doing the same timepoint comparison for those cell types. I've done the multiple regression analysis for those other cell types, and for some the effect of the total Epithelial count isn't significant. In that case, would I not normalize to the total Epithelial population, even if I'm interested in representing the data as a proportion on a bar graph?

(

My statistic knowledge is admittedly a bit rusty, so please double-check everything I am saying here)The p-value corresponds to the chance to obtain a test statistic that extreme or even more extreme under the assumption that the null hypothesis is true. Thus, the significance threshold corresponds to the probability of committing a type I error (registering a false positive). The thresholds 0.05 (5%), 0.01 (1%) and 0.01 (0.1%) are commonly chosen, but what you (and the reviewers of your manuscript) deem acceptable under the specific circumstances of the test is arbitrary.

Hence, that the epithelial count is or is not significant for some cell types is not a decisive factor per se, but it is of course a strong indicator. Nonetheless, you are allowed and also supposed to use any other bits of information, both in terms of biology and statistics, to guide your analysis. If the cells you are quantifying are considered to be a subset of epithelial cells based on cell markers and biology (e.g. differentiate from epithelial stem cells), then you have strong arguments to normalize even if your test statistic does not meet your chosen significance threshold.

Also look at the other values to interpret your test result. Firstly, you have an adjusted R-squared value of 0.9809. That is quite high and means your model explains essentially all variation observed. The coefficient of 0.67 for epithelial cells means, that for every epithelial unit (cell) you record, you on average record 0.67 units (cells) of pneumocytes. Hence, on a slide that has 100 epithelial cells more than another, you also expect to count on average 67 more pneumocytes. But (depending on how you calculate your confidence interval from the standard error) anything in the range of 50 - 80 more pneumocytes would also not raise an eyebrow.

If you have more cell types of interest, you can normalize them differently if you treat and also display them separately (especially if there are biological differences). If you plan to include them in the same plot or panel, you should also find a common normalization. In that case, you can run a Multivariate Analysis of Variance (MANOVA) to test which of the independent variables that you believe might affect the cell counts could be factors or covariates. This test unfortunately assumes multivariate normality and I have doubts if that applies to any lung tissue (but if your sections are very comparable, it might).

This is also the reason why I recommended a non-parametric test, because it does not make any assumptions about the distribution from which your realizations (cell numbers counted) are drawn. Those tests are very conservative, so they are not the most favourite choice for authors who engage in p-value hacking just to publish a "X causes Y: Statistically significant!" paper. But you may choose a different test, if you have good reasons for it and are aware of the assumptions you are making.

Counting more than 3 slides per time point is for example a good idea to understand the underlying distributions better and test them for normality. It is not my cup of tea, but there are really advanced AI models for image segmentation (U-Net and successors), so if you have 50 slides and just can't manually count so many cells, you could use those on photos and get excellent data.

One idea would be to divide Pneumocytes by Epithelial since that's what you're interested in, and then divide again by AllCells to normalize. Then you can do a t-test to compare time point 1 to time point 2.

What sort of data is this? Something like RNA-seq?

This is data from stained tissue sections, in which we manually counted the cell types based on the markers described above