Entering edit mode
3.7 years ago
Steve Pederson
•
0
We have an RNA Seq dataset where PC1 primarily captures library size (54% of variance), even after TMM normalisation. Our proposed solution is to use RUV (k=1
) to add a single term (W_1
) to the design matrix. After fitting the model using RUVg
, W_1
effectively captures PC1 which is pretty much library size, and which is exactly what we think we want.
Can someone please help me understand why this is a bad idea? Should I just add log10(lib.size)
or PC1 to the design matrix instead? The results are near identical.
(In case you're wondering why it's a bad idea? Reviewer #2 says so.)
This sounds like you have undersequenced one group of samples. Is sequencing depth confounded with treatment groups? Have you considered just sequencing the low depth samples deeper rather than messing with batch correction strategies here? Please give some metadata, so what are the samples, how deep, which species etc. Typically normalization should remove any library size differences unless you sequenced one group so shallow that dropouts (so many zeros) occur. Please also share plots and relevant code.
Thanks for the reply, but fortunately there is no confounding with any treatment groups. Library sizes range from 13.5million to 22.5million. Spread relatively evenly across the range, with most above 17million & only two below.
Does not sound too bad, meaning that something unusual is going on in your data. I think with the given information there is not much more to say unless you add details. Would be good to know what exactl the reviewer said and what they suggested to be instead. I've never seen confounding based on library sizes after normalization, especially not with total counts being relatively similar, but as said, plots and code may help here.