Question: Methods for analyzing and correcting RNA-seq dataset balance
1
gravatar for abe
17 months ago by
abe10
abe10 wrote:

I’m working with an imbalanced set of cell line RNA-seq data with small group sizes. For example, I have four samples exposed to condition A, two samples exposed to condition B, two samples exposed to condition C, etc. Group sizes range from two to five. Individual groups were run as a batch, but separate groups were run in different labs. Obviously this isn’t ideal, but some reads had to be discarded due to significant quality issues and it wasn't possible to run all samples in the same lab.

I’m looking for literature that a) provides insight into best practices for balancing this type of data, and b) perhaps a way to characterize dataset balance before/after balancing.

If necessary to know, I am ultimately seeking to conduct differential expression analysis. I do know that tools like DESeq2 are supposed to be valid for imbalanced data, but I’m wondering to what extent a dataset is just too imbalanced and requires correction or is simply not usable.

I could also extend this question into batch effect correction. How do you characterize a dataset as being too imbalanced for a tool like Combat? What’s the cutoff? Note, I am aware that including batch as a covariate in DE analysis is preferred over removing batch effects with Combat. I’m just interested to learn the best practices for defining how balanced an RNA-seq dataset is.

rna-seq • 518 views
ADD COMMENTlink modified 17 months ago by Charles Warden7.7k • written 17 months ago by abe10

I would ask this on Bioconductor forum: https://support.bioconductor.org/t/Latest/

When you ask it there, please mention that you first asked here, and provide the link.

ADD REPLYlink written 17 months ago by Kevin Blighe60k
0
gravatar for Charles Warden
17 months ago by
Charles Warden7.7k
Duarte, CA
Charles Warden7.7k wrote:

I think these things can sometimes be hard to precisely determine. However, if you have a highly asymmetric gene list, then it might be worth checking if you have substantially more samples in the group that is relatively up-regulated.

I would also recommend testing different methods for every project (so, for your particular set of samples, maybe some methods can better handle your unbalanced design than others).

That said, you need to critically assess your results: if you don't have a way to test the effect of the ComBat correction, then I would be skeptical of that result (and I would be cautious that the normalization methods may sometimes show over-fitting that can actually add bias into your results).

To be honest, I typically use multi-variate differential expression models instead of ComBat, and (if you have discrete groups) test visualizing expression that is centered by each group you wanted to correct (although you could visualize expression before and after the ComBat adjustment). The co-variate centered visualization won't make sense if you only have one representative sample per group, but I would also be suspicious of any method that provides a result that was supposed to correct for a variable that isn't randomized across whatever you want to adjust for (or has only one sample to represent the interaction of two variables).

ADD COMMENTlink written 17 months ago by Charles Warden7.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 976 users visited in the last hour