Question

Is my data too noisy for DESeq? Should I model noise as unwanted variation?

0

Entering edit mode

11 months ago

Ben ▴ 20

I am trying to relate a factor (sensitivity) to gene expression. I have ~40 samples of breast cancer, each a different cell line, from a few lung cancer subtypes.

When I model my known clinical factors by variance partition to examine the variation explained from each, I see that the vast majority of my variance in gene expression is explained by unknown factors and represented by residuals. It seems my data is very noisy (which is probably to be expected as each sample comes from a different primary tumour). enter image description here

My question is: should I model these residuals / noise as unwanted variation using RUV or similar? With the aim of increasing variance explained by sensitivity, separating them strongly on PCA by sensitivity, then running DESeq with these in my design formula?

enter image description here

Or do I have to accept that my data is too noisy for DESeq analysis and look elsewhere for markers of sensitivity?

Crosspost on Bioconductor: https://support.bioconductor.org/p/9155039/

rna-seq deseq • 1.7k views

ADD COMMENT • link updated 11 months ago by LauferVA 4.5k • written 11 months ago by Ben ▴ 20

0

Entering edit mode

How are you calculating these residules?

ADD REPLY • link 11 months ago by i.sudbery 20k

0

Entering edit mode

variancePartition package in R : https://bioconductor.org/packages/release/bioc/html/variancePartition.html

ADD REPLY • link 11 months ago by Ben ▴ 20

0

Entering edit mode

do you really have 1 sensitive sample and 3 moderate? if so, you may have a bad time.

with regard to the variability, gene expression generally is ... variable

because you are looking at a somatic state, stratification by or controlling for sample genotype may possibly help you. the idea here would be that, because tumor samples can have large scale changes to their chromosomal complement, stratification by the most common chromosome or chromsome-arm level changes could (possibly) increase statistical power.

however, if you do truly have only one sensitive sample, you wont be able to make that info useful, even if it would otherwise be effective...

ADD REPLY • link 11 months ago by LauferVA 4.5k

0

Entering edit mode

I was considering grouping moderate with sensitive, so 4 'sensitive' and remaining resistant. Small sample sizes unfortunately.

Re: stratification by chromosome - how would I use this info to stratify my samples? Would I identify most common chromosome (or chromosome arm) for each sample, and group them in this way? And then identify the variance attributed to each chromosome arm group?

If you have an example paper where this has been used to help me understand, that'd be amazing :)

ADD REPLY • link 11 months ago by Ben ▴ 20

0

Entering edit mode

Re moderate + sensitive - yes if thats the best you can do, do that; could also code the response as 1 and 0.5 (with resistant being 0).

i dont have a citation handy beyond the literature that discusses stratified versus pooled analysis generally, but the idea is the same. there are definitely good videos on performance of stratified analysis vs. pooled in a variety of contexts that will let you see that concept in action. perhaps another reader will have a domain specific example, but i dont think one is strictly necessary.

most cancers have such large scale genomic changes. considering going to cBioportal, selecting a large study, selecting only samples with mRNA, then viewing the CNA track for that malignancy.

ADD REPLY • link 11 months ago by LauferVA 4.5k

0

Entering edit mode

Okay, that makes sense, I'll take a look at some of those videos - thanks for your suggestions :)

ADD REPLY • link 11 months ago by Ben ▴ 20

0

Entering edit mode

Also, I wanted to mention that I have other sets of comparable data from other cancer lineages, that have more even groups (15 sensitive and 15 resistant), if that makes a difference to your answer.

ADD REPLY • link 11 months ago by Ben ▴ 20

0

Entering edit mode

that makes a tremendous difference.

here the answer is, you process both datasets as similarly as possible, then you organize all the data you have into a meta-analysis and analyze all the data you have jointly.

the issue is that there will likely be batch effect issues that generate spurious (false positive and false negative) results because they are driven by difference between batches not real biology.

thus, you need to employ techniques that control for this. there are many options. for instance, you could control for batch as a covariate, and see if that brings the sensitive and resistant samples in line with each other, etc.

ADD REPLY • link 11 months ago by LauferVA 4.5k

0

Entering edit mode

to date, every analysis ive ever done (whether expression data or otherwise) has benefited from data pooling followed by application of meta-analytic techniques, but its a lot more work.

ADD REPLY • link 11 months ago by LauferVA 4.5k

score 3 · Answer 1 · 2023-11-08

3

Entering edit mode

11 months ago

i.sudbery 20k

RNAseq data is distributed as a negative binomial. That means that even after accounting for any possible systematic sources of variation (known or unknown), we expect that the shot variance to be at least as big as the mean, simply from the act of randomly sampling fragements from the test-tube. Add into this variation from the vargeties of library prep etc, and residule variance is always going to be large.

Look at it this way: even if you sequence the same samples twice, we expect the standard deviation for any one gene to scale as the square of the mean counts for that gene.

PCA isolates patterns in the covariance matricies - places where the variance isn't random. But you've only got 5 samples that arn't "resistant", so their would have to be a pretty strong variance relationship bewteen those 5 samples, to make it the most striking feature of the dataset when they are only a minority of samples.

You say should you accept this data is too noisey for DESeq, but have you tried to find out whether DESeq pulls out any genes? Given that you definately do have some variance explained by sensitivity, I would expect that you would find some genes DE.

There is also no harm in trying something like RUV or SVA to see if their are systematic sources of variance that can be removed, I don't think it can do any harm, although I wouldn't put much stock by it actaully improving things much.

ADD COMMENT • link 11 months ago by i.sudbery 20k

1

Entering edit mode

Thanks for the explanation, that really helps my intuition. I have tried to run DESeq without any adjustments, and was consistently pulling out lots of DEGs associated with particular gene sets (e.g. cell-cell adhesion molecules, ribosomes) which don't match with any literature / expectations for the samples I am studying, so I wanted to investigate potential sources of bias, batch effects, variation, etc I can adjust for.

My plan now is to run RUV, find k that shows separation on PCA by sensitivity, run DESeq with RUV vectors in design and without and see if I pull out the same DEGs / gene sets on GSEA. Sound reasonable?

ADD REPLY • link 11 months ago by Ben ▴ 20

1

Entering edit mode

Its certainly worth a try. You should also be aware that GO analysis of RNAseq results is often biased by gene length and expression level biases. GSEA is a bit better, but there is still some evidence that gene sets that contain longer/more highly expressed genes are more likely to be enriched. Ribosomal biogensis is a classic example of this. Ribosomal biogensis is also a known cancer related pathway, as is cell-cell adhesion, so unless the enrichment is done against a proper background set one can also see where these might come from.