Hi everyone,
I have the counts for 18160 genes that I'm studying. I would like to know if there are some differences in the gene expression depending on two conditions: bad responders to a treatment (BR) and god responders to the same treatment (GR). The sample sizes are: BR=11, GR=20, which I would consider as imbalanced.
For the gene expression analysis I'm following edgeR pipeline, which I have used before. Once the analysis was done, I checked the distribution of the raw p-values (no correction applied), and I saw what now you can see on the image.
P-values were highly skewed towards 0. Is the first time I'm seeing this, and I wasn't sure about the reason behind this werid distribution. After some research, I found this options as possible reasons:
- Some test criteria were not met. I doubt this, except for the imbalanced sample sizes.
- There are some outliers affecting the comparisons. Firstly, I run some PCA and other analysis to check for outiers, and I found one extremely weird sample. Even after its removal, the weird p-value distribution didn't change.
- There is some batch effect which I'm not accounting for. Honestly, this might be the only explanation. On one hand, I can't find any batch effect on my samples that is causing any kind of "unkown group distintion" on the PCA. On the other hand, the batch, or covariates, effect might be affecting the samples in an isolated way, so they are not being grouped in "clusters".
I'm almost sure it is not a problem with library sizes, since they look fine. As a side note, the PCA's principal components 1 and 2, which I used for the exploration, had extreme values (between -100 and 100). As far as I know, these values seem to be way too high. Also, I performed T-test comparison on the log2CPM values, just to check the distribution of the p-values, and it follows exactly the same distribution.
Any idea where the problem might be? Could it be due to a batch effect that I am not taking into account? Thanks!!
If you're confident in the experimental design and analysis, then this would indicate two sample with few DEGs? For example, I have an inducible shRNA Ctrl and Gene-targeted. The inducible aspect is a little leaky, so I see some differences between shCtrl and shGene even when uninduced, but only very few are significant.