Question

Effect of sample size on log fold change in gene expression analysis

0

Entering edit mode

3.9 years ago

Gene_MMP8 ▴ 240

I am doing gene expression analysis using a set of 228 patients (52 positive and 176 negatives). I found a list of 17 DEGs that were significant (|log fold change| > 1 and p-value <0.1). I later found out that I can't include 30 patients from my negative class because they don't meet certain criteria. So, I removed them and redid the analysis. Now I get fewer DEGs than before (12 in number) and all 12 overlaps with the initially found 17 DEGs. The remaining five genes that were not selected as DEGs had a log fold change very close to 1.
I don't understand a couple of things:

What is the relation between sample size and log fold change? Why does it change when I remove samples?
Is there any statistical test I can do to show that the 5 genes that I am not getting as DEGs in the second analysis are just due to a statistical anomaly and nothing else?

RNA-Seq R • 1.3k views

ADD COMMENT • link updated 3.9 years ago by dsull ★ 5.8k • written 3.9 years ago by Gene_MMP8 ▴ 240

score 3 · Answer 1 · 2020-05-16

No relation; log fold change represents effect size, not uncertainty. If you remove samples, of course your log fold changes are going to change. Naively, think of a gene's fold change as the average expression of a gene in one group divided by the average expression of that gene in the other group. If you start removing samples from one group, the average expression of that gene in that group is going to change, hence, your fold change will also change.
There is no such a thing as a "statistical anomaly" (in the way that you use the term). Statistics is used to describe and quantify uncertainty, hence, why we have p-values. If I see a p-value of 0.00000001, I'm going to be more certain that a gene is "differentially expressed" than if I see a p-value of 0.44 (could a p-value of 0.44 still be a differentially expressed gene? In theory, it could -- but we don't have sufficient evidence as is.).

Also, I should mention the following:

How are you doing gene expression analysis? Is this RNA-seq/microarray data? If so, I'm hoping you're using established packages such as limma for your analyses. How are you getting your p-values?
p-value < 0.1 isn't sufficient. You need use false-discovery rate (FDR) control and "adjust" your p-values. Read up on the multiple comparisons problem. It's possible that all your so-called differentially expressed genes are false positives.
You need to be more rigorous with how you're analyzing data. Whether this is qPCR, microarray, or RNA-seq, have you done any quality control analysis of your data? Have you created a PCA plot or heatmap to even see if your "positive" patients cluster separately from your "negative" patients?