Hi everyone,
I’m analyzing a RNAseq dataset and I’ve come across something that is giving me some headaches.
First some background: I have eight cyanobacterial samples from two different strains (four vs four) that show a (rather small) phenotypic difference under the same experimental conditions. My goal is to identify differentially expressed genes that can potentially explain this phenotypic difference. Though I wasn’t expecting to find a lot of DGEs since the strains are very similar.
Now the data; After aligning and counting, I found the total number of counts differs between samples (figure 1A), as it does too the ribosmal RNA depletion efficiency (figure 1B). All normal up to here, the rRNA depletion kits are not optimized for cyanobacteria and it is not the first time I see this discrepancy in efficiency between samples. A couple more observations, the percentage of rRNA correlates with the total number of counts (figure 1C) and after removing the rRNA counts, all samples have similar number of counts (figure 1D).
I removed the rRNA counts and used DESeq2 to perform variance stabilization on the counts and perform PCA to see how the samples were clustering (figure 2). PC1 explains ~ 50% of the variance and, at first glance, it Iooked like it was separating the strains with the exception of S4. An outlier I thought.
But then I looked again at the percentage of rRNA in each sample and realized the samples’ clustering in PC1 could also be explained by the rRNA percentage, I plotted this and the correlation is there (figure 3).
So I’m wondering a few things.
- Does it make any sense that most of the variation in the data is explained by the percentage of rRNA?
- Or is this just an artifact?
- If it is not, how can this be explained?
- Does it make the data unusable for my original question?
Thank you in advance!
Note: in case it matters, the ribosomal depletion was performed with MICROBExpress kit.
Thanks for you answer.
As for your practical note, that indeed sounds wise, will definitely implement it.
On the other topic, I gave a try to your suggestion but still get a very similar score plot for the PCA. But I thought removing rRNA before normalization was a common practice, isn’t it that the case? I’ve even seen pipelines where the rRNAs are removed at the counting step. Or is this done under the assumption that all samples have equal rRNA depletion efficiency?
In any case, after trying a few more things, I just came to terms with the fact that my two experimental conditions have very similar gene expression patterns and that’s why the rRNA depletion efficiency pops up as the highest source of variation. What I still cannot understand is how this efficiency can introduce such a bias, specially when the samples have a similar number of counts after removing the rRNA…
Anyway, thanks for your help.
It is, and it's the right way to do it. My suggestion was not a good one. Best to remove either before normalization, or after normalization (e.g., exclude from the TopTable); but not "use for library size and then remove and then normalize" which is...weird.
Now that you've removed the rRNA -- what happens if you run
fGSEA
on your PC1 loadings?If I run
fGSEA
on PC1 loadings I get a few significant (adj P-value < 0.01) terms, 7 when I use KEGG ids and 2 when I use GO terms (Biological Process).I've also run
fGSEA
on the LFCs from DESeq2. I includedrrna_proportion
in the design as you suggested. Then I extracted LFCs for bothstrain
orrrna_proportion
separately. I get way more DEGs (adj P-value < 0.01) forrrna_proportion
than forstrain
(1233 vs 89 genes) and the LFCs distribution is also very different (see volcano plot).But when I run
fGSEA
on thestrain
LFCs I get a bunch of significant terms (13 from KEGG, 4 GO BP) most of them overlapping with the ones from PC1 loadings. And when I runfGSEA
on therrna_proportion
LFCs I get nothing significant. I think it starting to make sense.Cool. I was more wondering whether the GO terms for PC1 were residual "translation"-type GO terms. You would expect the LFC distribution to be very different as
rrna_proportion
is continuous andstrain
is binary.