Weird p-value distribution on edgeR results
3
1
Entering edit mode
11 days ago
Guillermo ▴ 30

Hi everyone,

I have the counts for 18160 genes that I'm studying. I would like to know if there are some differences in the gene expression depending on two conditions: bad responders to a treatment (BR) and god responders to the same treatment (GR). The sample sizes are: BR=11, GR=20, which I would consider as imbalanced.

For the gene expression analysis I'm following edgeR pipeline, which I have used before. Once the analysis was done, I checked the distribution of the raw p-values (no correction applied), and I saw what now you can see on the image.

Distribution of edgeR p-values

P-values were highly skewed towards 0. Is the first time I'm seeing this, and I wasn't sure about the reason behind this werid distribution. After some research, I found this options as possible reasons:

  • Some test criteria were not met. I doubt this, except for the imbalanced sample sizes.
  • There are some outliers affecting the comparisons. Firstly, I run some PCA and other analysis to check for outiers, and I found one extremely weird sample. Even after its removal, the weird p-value distribution didn't change.
  • There is some batch effect which I'm not accounting for. Honestly, this might be the only explanation. On one hand, I can't find any batch effect on my samples that is causing any kind of "unkown group distintion" on the PCA. On the other hand, the batch, or covariates, effect might be affecting the samples in an isolated way, so they are not being grouped in "clusters".

I'm almost sure it is not a problem with library sizes, since they look fine. As a side note, the PCA's principal components 1 and 2, which I used for the exploration, had extreme values (between -100 and 100). As far as I know, these values seem to be way too high. Also, I performed T-test comparison on the log2CPM values, just to check the distribution of the p-values, and it follows exactly the same distribution.

Any idea where the problem might be? Could it be due to a batch effect that I am not taking into account? Thanks!!

edgeR pvalue-distribution gene-expression • 917 views
ADD COMMENT
0
Entering edit mode

If you're confident in the experimental design and analysis, then this would indicate two sample with few DEGs? For example, I have an inducible shRNA Ctrl and Gene-targeted. The inducible aspect is a little leaky, so I see some differences between shCtrl and shGene even when uninduced, but only very few are significant.

ADD REPLY
2
Entering edit mode
10 days ago
Gordon Smyth ★ 8.2k

You say that PCA principal components were between -100 and 100. I don't know how you have constructed the PCA plot but, if you have used edgeR's plotMDS() function, then such a scale for the x and y dimensions is impossible for reasonably behaved RNA-seq data. This suggests there is a serious issue of some sort with your data.

It is very unlikely that the issue has anything to do with imbalanced sample sizes, or library sizes, or lack of true DE. On the other hand, it is possible to get an p-value histogram like you show if there is a massive batch effect and the batch effect is balanced between conditions but very heterogenous within conditions. Extreme outlier samples can also produce such a histogram, although they have to be pretty extreme. Problem with batch effects or outliers should be very evident from plotMDS.

Anyway, without any information other than a p-value distribution, we're just guessing. To get more help from me you'd need to provide code to show what sort of counts you're inputing to edgeR, which edgeR pipeline you're using and, most importantly, diagnostic plots from plotMDS, plotBCV and plotQLDisp.

When I analyse noisy human RNA-seq data, I generally use limma-voomLmFit with sample.weights=TRUE and robust=TRUE to allow for outlier samples and outlier genes. We are working to incorporate sample weights into edgeR as well, but have not made it public yet. In the meantime, I would suggest limma for this sort of data. You still have to diagnose the data quality issue though, limma won't be just a plug-in solution.

ADD COMMENT
0
Entering edit mode

Hi Gordon! Thanks for answering.

Originally I only performed PCA on the log2 CPM. I'm afraid I forgot to use the MDS plot to check the distribution of the samples based on their pairwise distances. Looking at the results it seems that there are some samples that group together, I would say probably due to some batch effect for which I don't know the source. I guess this could be the reason behind the weird p-value distribution, but I'm not sure.

MDS plot (top 1000 genes): BR: bad responders, GR: good responders, SLE: treatment with SLE, BYP: treatment with BYP enter image description here

BCV plot: enter image description here

Quasi-dispersion plot: enter image description here

The (simplified) pipeline I've followed has been:

mod_design <- model.matrix(~ 0 + metadata[,'Type'] + metadata[,'Surgery'] + metadata[,'Edad'])
edge_list <- DGEList(counts=raw_counts, group=metadata[,'Type'])
keep <- filterByExpr(edge_list)
edge_list_keep <- edge_list[keep,,keep.lib.sizes=FALSE]
edge_list_tmm <- normLibSizes(edge_list_keep, method = "TMM")
edge_ed <- estimateGLMCommonDisp(edge_ed, mod_design)
edge_ed <- estimateGLMTrendedDisp(edge_ed,mod_design, method='bin.spline')
edge_ed <- estimateGLMTagwiseDisp(edge_ed,mod_design, trend=TRUE)
mod_fit <- glmQLFit(edge_ed, mod_design, prior.count=0.125)
mod_out <- glmQLFTest(mod_fit, contrast=c(1,-1, rep(0,ncol(mod_design)-2) ))

I will take a look at limma-voom lmFit function, looks interesting. Any suggestion on where to start? I have never used limma-voom, though I know it can be used alongside edgeR.

ADD REPLY
1
Entering edit mode

No outliers. Very clear hidden variable that hasn't been accounted for.

voomLmFit will give same results here as edgeR.

ADD REPLY
2
Entering edit mode

Thanks! Limma-Voom lmFit approach fixed the problem

ADD REPLY
3
Entering edit mode
3 days ago

On the other hand, it is possible to get an p-value histogram like you show if there is a massive batch effect and the batch effect is balanced between conditions but very heterogenous within conditions.

I have been struggling to see why that is the case. In case it helps anyone, consider this example:

enter image description here

There is a small but real difference between conditions. If you repeat this experiment several times the difference will stay there, but the variance induced by the batches will keep overwhelming the difference between conditions leading to high p-values. This is different from having two conditions with large within-condition variance. In such case the histogram of pvalues will look flat-ish because you still get datapoints around the true mean of each condition. With batch effects as in the figure, you don't see datapoints near the true means.

ADD COMMENT
0
Entering edit mode
11 days ago
LChart 5.0k

Even after its removal, the weird p-value distribution didn't change.

Did you re-run PCA after its removal to look for more outliers? Outlier removal tends to be an iterative process.

On the other hand, the batch, or covariates

or covariates

This is likely the culprit. Plot the covariates (barplot or boxplot depending on data type) stratified by your groups. If you have one that stratifies your group, then you have a confounder. Confounders will deflate your p-value distribution in precisely this way.

ADD COMMENT
0
Entering edit mode

Hi there! Thanks for answering.

Yes, I kept performing PCA after each removal, looking for new outliers. The result was a never-ending process, where I kept removing samples that looked like outliers, with really high values for each component. When I went from 11-20 samples to 6-6 I decided to stop. After this point, results were improving in some way (p-value distribution was a bit more uniform), but I was not sure about the multiple steps of removing samples.

Regarding covariates, the problem is that I don't have any apart from age, gender (but all the samples are women) and type of surgery (which I already took into account for the analysis as a possible cofounder). This is an analysis we are performing for another person, so we are working with all they gave us. I will try to ask them for more data, but I guess we already have all the information they collected.

ADD REPLY

Login before adding your answer.

Traffic: 1529 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6