Question

Differential expression: High variability inter-samples

0

Entering edit mode

5.9 years ago

VHahaut ★ 1.2k

Hi!

We recently had to run a differential expression analysis involving RNA-seq from ~20 tumors against several controls. Our final goal was to extract the main differences between our cases and controls. While running DESeq2 on these samples we observed a relatively high variation between cases (which was expected) while the controls were quite similar.

The analysis revealed ~3000 thousands of genes differentially expressed between our two conditions. However when we looked at the read counts of these differentially expressed genes we saw that only a subset of the samples where expressing it. In other words, for most differentially expressed genes only a subset of our cases are driving the signal. We are afraid that we only uncover differentially expressed genes that we cannot qualify as "cases vs controls" but mainly due to inter-cases variability. The issue here seem to lies in the case group which is too heterogeneous (diagnostic time, drugs, ...). Unfortunately it is not possible to regenerate a more homogeneous dataset.

Our two next approaches will include:

Batch effect correction.
Run the analysis several times with a subset of the case samples and compare the results.

Does anyone would have a comment or solution (if it exists) to extract the main signal without looking too much at the inter-sample variability?

Thank you in advance!

DESeq2 edgeR limma • 1.3k views

ADD COMMENT • link 5.9 years ago by VHahaut ★ 1.2k

1

Entering edit mode

Why would you do batch correction if there is only one batch? Were you going to do some latent variable discovery on the raw counts, like with SVA?

It would be interesting to also hear about how you generated your raw counts, and, in addition, low count (and other) filtering that you did prior to normalisation.

Also, what was your design model?; what did PCA bi-plots reveal?; How did the dispersion plot look?;

Just out of curiosity, in addition, if there really is a lot of variabilty, then I would have thought that some of the genes would have failed either of the independent filtering or Cook's Distance outlier test. These are controlled with the results() function.

ADD REPLY • link 5.9 years ago by Kevin Blighe 87k